# Quick Start Guide: Optimal Prescriptive Trees

In this example we will give an example of how to use Optimal Prescriptive Trees (OPT). We will use synthetically-generated data to better understand the workings of prescriptive trees and how we can fairly evaluate performance out-of-sample in a prescriptive problem.

## Data for prescriptive problems

Prescriptive trees are trained on observational data, and require three distinct types of data:

`X`

: the features for each observation that can be used as the splits in the tree - this can be a matrix or a dataframe as for classification or regression problems`treatments`

: the treatment applied to each observation - this is a vector of the treatment labels similar to the target in a classification problem`outcomes`

: the outcome for each observation under the applied treatment - this is a vector of numeric values similar to the target in a regression problem

Refer to the documentation on data preparation for more information on the data format.

First, we will generate the features `X`

with a mix of binary and numeric data:

```
using Random, DataFrames
Random.seed!(123)
n = 1000
p = 10
function generate_X(n, p)
X = zeros(n, p)
for j = 1:p
X[:, j] = isodd(j) ? randn(n) : rand(0:1, n)
end
DataFrame(X)
end
X = generate_X(n, p)
```

```
1000×10 DataFrames.DataFrame. Omitted printing of 4 columns
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├──────┼─────────────┼─────────┼───────────┼─────────┼────────────┼─────────┤
│ 1 │ 1.19027 │ 0.0 │ -0.387982 │ 1.0 │ -0.797032 │ 0.0 │
│ 2 │ 2.04818 │ 0.0 │ 0.140906 │ 1.0 │ -1.51078 │ 0.0 │
│ 3 │ 1.14265 │ 1.0 │ 0.0715651 │ 1.0 │ 1.51331 │ 0.0 │
│ 4 │ 0.459416 │ 1.0 │ -0.956586 │ 1.0 │ -0.470199 │ 0.0 │
│ 5 │ -0.396679 │ 1.0 │ 0.990591 │ 0.0 │ 1.54503 │ 0.0 │
│ 6 │ -0.664713 │ 0.0 │ 0.694683 │ 1.0 │ -1.58767 │ 0.0 │
│ 7 │ 0.980968 │ 1.0 │ -1.1144 │ 1.0 │ -1.73695 │ 1.0 │
⋮
│ 993 │ -0.0864289 │ 1.0 │ 0.195309 │ 1.0 │ 2.17761 │ 0.0 │
│ 994 │ 1.21069 │ 1.0 │ 0.980612 │ 0.0 │ -1.62596 │ 0.0 │
│ 995 │ 0.241134 │ 0.0 │ 0.833495 │ 0.0 │ -0.0331607 │ 0.0 │
│ 996 │ -1.6429 │ 0.0 │ 1.7444 │ 1.0 │ -0.0457842 │ 0.0 │
│ 997 │ -0.00522395 │ 0.0 │ -0.96827 │ 1.0 │ -0.60585 │ 1.0 │
│ 998 │ -1.23102 │ 0.0 │ -0.918752 │ 0.0 │ -0.0344973 │ 0.0 │
│ 999 │ -1.76795 │ 0.0 │ 0.519781 │ 0.0 │ -0.391488 │ 1.0 │
│ 1000 │ 0.318717 │ 1.0 │ -0.730775 │ 0.0 │ -0.603678 │ 0.0 │
```

We will consider data with two treatments ($A$ and $B$) where the treatment effect $y(B) - y(A)$ is given by

where $\Iota(x)$ is the indicator function. We can then generate the outcomes as follows:

```
function treatment_effect(x)
4 * (x[1] > 1 ? 1 : 0) * (x[3] > 0 ? 1 : 0) +
4 * (x[5] > 1 ? 1 : 0) * (x[7] > 0 ? 1 : 0) +
2 * x[8] * x[9]
end
function generate_outcomes(x)
baseline = 10
effect = treatment_effect(x)
yA = baseline - 0.5 * effect
yB = baseline + 0.5 * effect
yA, yB
end
```

Let's look at some examples:

`[generate_outcomes(X[i, :]) for i = 1:3]`

```
3-element Array{Tuple{Float64,Float64},1}:
(10.0, 10.0)
(8.0, 12.0)
(5.6214936204838, 14.3785063795162)
```

This gives us the ground truth for the data. We will assume that a lower outcome is more desirable, and so the optimal treatment is the one with the lowest outcome.

Now, we will generate some realistic training data. In real observational data, we usually see that the treatments are not assigned optimally nor entirely randomly. Instead, the treatments are somewhere inbetween: the treatment assignments are generally in the direction of optimality but with mistakes as well. To imitate this phenomenon with our synthetic data, we will assign the treatments randomly, but with the chance of receiving the optimal treatment depending on how far apart the outcomes for the two treatments are. In this way, the more obvious the choice of best treatment is in the ground truth, the more likely the training data will assign this treatment. The exact approach we use is:

In this way, the higher (worse) the impact of the treatment $B$, the lower the probability of being assigned $B$, and vice-versa.

The following code generates the treatment assignments and corresponding outcomes. We also add some small random noise to the outcomes.

```
function generate_treatment_and_outcome(X)
n = size(X, 1)
treatments = fill("", n)
outcomes = zeros(n)
for i = 1:n
outcomeA, outcomeB = generate_outcomes(X[i, :])
effect = outcomeB - outcomeA
prob = exp(effect) / (exp(effect) + 1)
if rand() < prob
treatments[i] = "A"
outcomes[i] = outcomeA
else
treatments[i] = "B"
outcomes[i] = outcomeB
end
outcomes[i] += 0.2 * randn()
end
treatments, outcomes
end
treatments, outcomes = generate_treatment_and_outcome(X)
```

We can look at some examples:

`treatments[1:3], outcomes[1:3]`

`(["A", "A", "A"], [10.0765, 7.99791, 5.68372])`

## Training Optimal Prescriptive Trees

We will use a `GridSearch`

to fit an `OptimalTreePrescriptionMinimizer`

:

```
grid = IAI.GridSearch(
IAI.OptimalTreePrescriptionMinimizer(),
max_depth=1:5,
)
IAI.fit!(grid, X, treatments, outcomes)
IAI.get_learner(grid)
```

We can see that the structure of this tree is aligned with the synthetic data generation process:

- It makes strongest predictions for treatment $A$ when $x_1 > 1$ and $x_3 > 0$ or when $x_5 > 1$ and $x_7 > 0$, as this is when indicator terms activate and the outcome for treatment $B$ is significantly higher.
- When neither indicator term is active, and $x_8$ is zero, both treatments are the same up to random noise, and so the prescriptions have roughly the same predicted outcome.
- When neither indicator term is active, and $x_8$ is one, the prescribed treatment follows the sign of $x_9$.

Based on this tree, and knowing the underlying generation process, we can be confident in the ability to recover the true mechanism behind the optimal treatment policy.

## Evaluation against synthetic data

Let's evaluate it using the synthetic ground truth. First we generate some new data according to the ground truth:

```
n_test = 10000
test_X = generate_X(n_test, p)
test_outcomes = zeros(n_test, 2)
for i = 1:n_test
test_outcomes[i, :] .= generate_outcomes(test_X[i, :])
end
```

Now we can use our model to make predictions on the new data and compare against the truth. We consider two measures, the accuracy of our treatment predictions against the true treatments, and the R-squared of our outcome predictions against the true best outcomes.

```
using Statistics
function evaluate_synthetic(grid, X, outcomes)
best_treatment = map(i -> outcomes[i, 1] < outcomes[i, 2] ? "A" : "B",
1:size(X, 1))
best_outcome = map(i -> minimum(outcomes[i, :]), 1:size(X, 1))
treatment_predictions, outcome_predictions = IAI.predict(grid, X)
treatment_accuracy = mean(best_treatment .== treatment_predictions)
outcome_accuracy = 1 - sum(abs2, best_outcome .- outcome_predictions) /
sum(abs2, best_outcome .- mean(best_outcome))
Dict(
:treatment_accuracy => treatment_accuracy,
:outcome_accuracy => outcome_accuracy,
)
end
evaluate_synthetic(grid, test_X, test_outcomes)
```

```
Dict{Symbol,Float64} with 2 entries:
:treatment_accuracy => 0.9375
:outcome_accuracy => 0.746511
```

This says that our model identifies the correct treatment roughly 90% of the time, and the $R^2$ of the predicted outcomes is about 0.7.

We can try validating over `prescription_factor`

to see if that improves the quality of the results. Note that when we validate over `prescription_factor`

we need to specify a validation criterion - here we choose `:prediction_accuracy`

to validate based on the quality of the predictions made by the tree:

```
grid = IAI.GridSearch(
IAI.OptimalTreePrescriptionMinimizer(
max_depth=5,
),
prescription_factor=0.0:0.2:1.0,
)
IAI.fit!(grid, X, treatments, outcomes,
validation_criterion=:prediction_accuracy)
IAI.get_learner(grid)
```