# Advanced

## Classification Trees with Logistic Regression

Optimal Regression Trees can fit linear regression models in each leaf as part of the tree training process, which can drastically increase the power of the tree. This is not available for Optimal Classification Trees, as fitting logistic regression models during the training process is computationally infeasible.

However, there are a number of ways to produce classification trees with logistic regression in the leaves, which for certain problems can lead to improved results. We will demonstrate these approaches on the following synthetic dataset:

```
using DataFrames, Statistics, StableRNGs
rng = StableRNG(1) # for consistent output across Julia versions
X = DataFrame(rand(rng, 200, 5), :auto)
betax = (((X.x1 .< 0.5) .* (X.x2 .+ X.x3)) .+
((X.x1 .> 0.5) .* (0.2 * X.x4 .+ X.x5)))
y = betax .> mean(betax)
(X_train, y_train), (X_test, y_test) = IAI.split_data(:classification, X, y, seed=1)
```

### 1. Refitting a trained classification tree

We start by training a normal Optimal Classification Tree:

```
grid = IAI.GridSearch(
IAI.OptimalTreeClassifier(
random_seed=1,
minbucket=20,
),
max_depth=1:3,
)
IAI.fit!(grid, X_train, y_train)
```

`IAI.score(grid, X_test, y_test)`

`0.7833333333333333`

We can see the results are not very good, which is likely due to the fact that the underlying data being linear in nature, but the tree can only predict constants in each leaf. We can refit the leaves of the tree with `refit_leaves!`

, using the `refit_learner`

to specify that the model in each leaf should be determined using L-1 regularized logistic regressions found with `glmnet`

(via `GLMNetCVClassifier`

):

```
lnr = IAI.get_learner(grid)
IAI.refit_leaves!(lnr, X_train, y_train,
refit_learner=IAI.GLMNetCVClassifier(n_folds=2),
)
```

`IAI.score(lnr, X_test, y_test)`

`0.9`

We see the performance has improved from before, as a result of adding in some linearity.

A limitation of this approach is that the tree structure is unchanged by the refitting process. This means that the new tree may not be pruned optimally with respect to the logistic regression models in each leaf. We might expect that adding logistic regression models in each leaf would allow us to use a smaller tree and still reach the same performance, but simply refitting the leaves with logistic regression models cannot solve this as neither the fitting or pruning of the tree has been conducted with any knowledge of the final linearity of leaf models.

### 2. Refitting a regression tree with linear predictions

One way to resolve the limitation of this first approach is to train the tree with some ability to fit linear predictions, as this might lead it to make splitting and pruning choices that are more suited to the linear nature of the final logistic regression models. We do this by first training a regression tree with linear predictions in the leaves, with our binary `y`

as the target:

```
grid = IAI.GridSearch(
IAI.OptimalTreeRegressor(
random_seed=1,
minbucket=20,
regression_sparsity=:all,
),
max_depth=1:2,
)
IAI.fit!(grid, X_train, y_train)
```

We can then take this trained regression tree and refit the leaf models with logistic regression instead of linear regression using `copy_splits_and_refit_leaves!`

, using `glmnet`

to fit the logistic regression models:

```
old_lnr = IAI.get_learner(grid)
new_lnr = IAI.OptimalTreeClassifier(random_seed=1)
IAI.copy_splits_and_refit_leaves!(new_lnr, old_lnr, X, y,
refit_learner=IAI.GLMNetCVClassifier(n_folds=2),
)
```

`IAI.score(new_lnr, X_test, y_test)`

`1.0`

This results in a classification tree with the same split structure as the regression tree, but with logistic regression models in each leaf where appropriate, and we can see the performance has improved.

### 3. Refitting a classification tree during training

We can also overcome the limitation of the first approach by using the `refit_learner`

parameter to incorporate the logistic regression refitting directly into the pruning process. This ensures the pruning of the tree will take into account the logistic regression models in the leaves, making it much easier to find the right-size tree.

In this case, we will fit again with logistic regression models with `glmnet`

:

```
grid = IAI.GridSearch(
IAI.OptimalTreeClassifier(
random_seed=1,
minbucket=20,
refit_learner=IAI.GLMNetCVClassifier(),
),
max_depth=1:2,
)
IAI.fit!(grid, X_train, y_train)
```

`IAI.score(grid, X_test, y_test)`

`0.9333333333333333`

We see that we now recover the correct tree structure, and have good performance. However, the ground truth regression models use fewer variables than our logistic regressions. To address this, we can also try using Optimal Feature Selection instead of `glmnet`

to construct our logistic regression models. We will use a `GridSearch`

that validates the sparsity of a `OptimalFeatureSelectionClassifier`

as the `refit_learner`

:

```
grid = IAI.GridSearch(
IAI.OptimalTreeClassifier(
random_seed=1,
minbucket=20,
refit_learner=IAI.GridSearch(
IAI.OptimalFeatureSelectionClassifier(),
sparsity=1:2,
),
),
max_depth=1:2,
)
IAI.fit!(grid, X_train, y_train)
```

`IAI.score(grid, X_test, y_test)`

`0.9`

We achieve similar performance to before, but now the logistic regression models are much sparser.

In the example above, the `refit_learner`

was a `GridSearch`

, which means that the `sparsity`

parameter will be validated separately in each leaf when fitting the logistic regression models. Another approach is to use the same sparsity in all leaves of the tree, and validate this shared `sparsity`

parameter as part of the overall validation procedure. We can do this by including `refit_learner_sparsity`

in the outermost `GridSearch`

as a parameter to be validated:

```
grid = IAI.GridSearch(
IAI.OptimalTreeClassifier(
random_seed=1,
minbucket=20,
refit_learner=IAI.OptimalFeatureSelectionClassifier(),
),
max_depth=1:2,
refit_learner_sparsity=1:2,
)
IAI.fit!(grid, X_train, y_train)
```