# Quick Start Guide

In this example we will give an example of how to estimate and use rewards for evaluating a prescriptive policy. We will use synthetically-generated data so that we can compare the results of the reward estimation against a known truth.

## Synthetic Data Generation

For this demonstration, we will use the synthetic data generation process from Experiment 2 in Section 4 of Bertsimas and Dunn (2019). First, we will generate the features X with a mix of binary and numeric data:

using StableRNGs, DataFrames
rng = StableRNG(1)
n = 1000
p = 10

function generate_X(rng, n, p)
X = zeros(n, p)
for j = 1:p
X[:, j] = isodd(j) ? randn(rng, n) : rand(rng, 0:1, n)
end
DataFrame(X)
end
X = generate_X(rng, n, p)
1000×10 DataFrame. Omitted printing of 4 columns
│ Row  │ x1        │ x2      │ x3         │ x4      │ x5         │ x6      │
│      │ Float64   │ Float64 │ Float64    │ Float64 │ Float64    │ Float64 │
├──────┼───────────┼─────────┼────────────┼─────────┼────────────┼─────────┤
│ 1    │ -0.53252  │ 0.0     │ -0.0515071 │ 0.0     │ 0.322352   │ 1.0     │
│ 2    │ 0.0984655 │ 1.0     │ -0.986713  │ 0.0     │ -1.52388   │ 0.0     │
│ 3    │ 0.752887  │ 1.0     │ -0.247705  │ 1.0     │ -1.19824   │ 1.0     │
│ 4    │ -0.843537 │ 0.0     │ -0.691102  │ 0.0     │ -0.470327  │ 0.0     │
│ 5    │ -2.01383  │ 1.0     │ 0.0807973  │ 0.0     │ 0.442388   │ 0.0     │
│ 6    │ -0.269808 │ 1.0     │ 0.514877   │ 0.0     │ -0.572981  │ 0.0     │
│ 7    │ 1.03229   │ 0.0     │ -0.757771  │ 1.0     │ -0.0038146 │ 1.0     │
⋮
│ 993  │ -0.230778 │ 0.0     │ 0.969102   │ 0.0     │ 0.470769   │ 0.0     │
│ 994  │ 1.74012   │ 0.0     │ 1.85216    │ 1.0     │ -0.104094  │ 1.0     │
│ 995  │ 0.173307  │ 1.0     │ 1.59839    │ 0.0     │ -0.891135  │ 1.0     │
│ 996  │ -0.74202  │ 0.0     │ 1.10626    │ 0.0     │ -2.01397   │ 1.0     │
│ 997  │ 1.10232   │ 1.0     │ 1.66978    │ 0.0     │ -1.01751   │ 1.0     │
│ 998  │ 0.479083  │ 0.0     │ 0.163705   │ 0.0     │ 0.13763    │ 0.0     │
│ 999  │ -0.676729 │ 1.0     │ -0.975063  │ 1.0     │ 1.09703    │ 0.0     │
│ 1000 │ -1.38026  │ 0.0     │ -0.913913  │ 0.0     │ 0.573111   │ 0.0     │

We will consider data with two treatments ($A$ and $B$) where the treatment effect $y(B) - y(A)$ is given by

$4 \Iota\{x_1 > 1\} \Iota\{x_3 > 0\} + 4 \Iota\{x_5 > 1\} \Iota\{x_7 > 0\} + 2 x_8 x_9$

where $\Iota(x)$ is the indicator function. We can then generate the outcomes as follows:

function treatment_effect(x)
4 * (x > 1 ? 1 : 0) * (x > 0 ? 1 : 0) +
4 * (x > 1 ? 1 : 0) * (x > 0 ? 1 : 0) +
2 * x * x
end
function generate_outcomes(x)
baseline = 10
effect = treatment_effect(x)
yA = baseline - 0.5 * effect
yB = baseline + 0.5 * effect
yA, yB
end

Let's look at some examples:

[generate_outcomes(X[i, :]) for i = 1:3]
3-element Array{Tuple{Float64,Float64},1}:
(10.0, 10.0)
(9.344889115287753, 10.655110884712247)
(10.0, 10.0)

This gives us the ground truth for the data. We will assume that a lower outcome is more desirable, and so the optimal treatment is the one with the lowest outcome.

Now, we will generate some realistic training data. In real observational data, we usually see that the treatments are not assigned optimally nor entirely randomly. Instead, the treatments are somewhere inbetween: the treatment assignments are generally in the direction of optimality but with mistakes as well. To imitate this phenomenon with our synthetic data, we will assign the treatments randomly, but with the chance of receiving the optimal treatment depending on how far apart the outcomes for the two treatments are. In this way, the more obvious the choice of best treatment is in the ground truth, the more likely the training data will assign this treatment. The exact approach we use is:

$P(T = A) = \frac{e^{y(B) - y(A)}}{e^{y(B) - y(A)} + 1}$

In this way, the higher (worse) the impact of the treatment $B$, the lower the probability of being assigned $B$, and vice-versa.

The following code generates the treatment assignments and corresponding outcomes. We also add some small random noise to the outcomes.

function generate_treatment_and_outcome(rng, X)
n = size(X, 1)
treatments = fill("", n)
outcomes = zeros(n)
for i = 1:n
outcomeA, outcomeB = generate_outcomes(X[i, :])
effect = outcomeB - outcomeA
prob = exp(effect) / (exp(effect) + 1)
if rand(rng) < prob
treatments[i] = "A"
outcomes[i] = outcomeA
else
treatments[i] = "B"
outcomes[i] = outcomeB
end
outcomes[i] += 0.2 * randn(rng)
end
treatments, outcomes
end
treatments, outcomes = generate_treatment_and_outcome(rng, X)

We can look at some examples:

treatments[1:3], outcomes[1:3]
(["B", "B", "B"], [9.905067319437086, 10.848475480632686, 9.900543979881958])

## Learning a Prescriptive Policy

In order to get a prescriptive policy to evaluate, we will use a GridSearch to fit an OptimalTreePrescriptionMinimizer to this data:

grid = IAI.GridSearch(
IAI.OptimalTreePrescriptionMinimizer(random_seed=123),
max_depth=1:5,
)
IAI.fit!(grid, X, treatments, outcomes)
IAI.get_learner(grid)
Optimal Trees Visualization

We can see that the structure of this tree is aligned with the synthetic data generation process:

• It makes strongest predictions for treatment $A$ when $x_1 > 1$ and $x_3 > 0$ or when $x_5 > 1$ and $x_7 > 0$, as this is when indicator terms activate and the outcome for treatment $B$ is significantly higher.
• When neither indicator term is active, and $x_8$ is zero, both treatments are the same up to random noise, and so the prescriptions have roughly the same predicted outcome.
• When neither indicator term is active, and $x_8$ is one, the prescribed treatment follows the sign of $x_9$.

Based on this tree, and knowing the underlying generation process, we can be confident in the ability to recover the true mechanism behind the optimal treatment policy.

## Evaluation Against Ground Truth

Let's evaluate it using the synthetic ground truth. First we generate some new test data according to the ground truth:

n_test = 10000
test_X = generate_X(rng, n_test, p)
test_outcomes = zeros(n_test, 2)
for i = 1:n_test
test_outcomes[i, :] .= generate_outcomes(test_X[i, :])
end

Now we can use our model to make predictions on the new data and compare against the truth. We consider two measures, the accuracy of our treatment predictions against the true treatments, and the R-squared of our outcome predictions against the true best outcomes.

using Statistics
function evaluate(grid, X, outcomes)
best_treatment = map(i -> outcomes[i, 1] < outcomes[i, 2] ? "A" : "B",
1:size(X, 1))
best_outcome = map(i -> minimum(outcomes[i, :]), 1:size(X, 1))

treatment_predictions, outcome_predictions = IAI.predict(grid, X)
treatment_accuracy = mean(best_treatment .== treatment_predictions)
outcome_accuracy = 1 - sum(abs2, best_outcome .- outcome_predictions) /
sum(abs2, best_outcome .- mean(best_outcome))

Dict(
:treatment_accuracy => treatment_accuracy,
:outcome_accuracy => outcome_accuracy,
)
end

evaluate(grid, test_X, test_outcomes)
Dict{Symbol,Float64} with 2 entries:
:treatment_accuracy => 0.8851
:outcome_accuracy   => 0.707024

This says that our model identifies the correct treatment roughly 90% of the time, and the $R^2$ of the predicted outcomes is about 0.7.

## Evaluation Using Reward Estimation

In real applications, we don't know the true outcomes for the test data, and so we cannot simply find the best outcome for each test point and use that as the correct treatment. Instead, we can use reward estimation to estimate the quality of the policy on the test set.

First, let's generate some test treatments and outcomes in the same way we generated our training data to imitate a real scenario

test_T, test_y = generate_treatment_and_outcome(rng, test_X)

We will use a RewardEstimator to estimate the rewards. For this example, we will use random forests with the direct method to estimate rewards:

reward_lnr = IAI.RewardEstimator(
propensity_estimation_method=:random_forest,
outcome_estimation_method=:random_forest,
reward_estimation_method=:direct_method,
random_seed=123,
)
Unfitted RewardEstimator:
propensity_estimation_method: random_forest
outcome_estimation_method:    random_forest
reward_estimation_method:     direct_method
random_seed:                  123

We chose the direct method because we want to calculate treatment and outcome accuracy on a per-point basis. It is critical that we use the direct method if we want to evaluate the policy per-point, as the other methods can only be used to evaluate the policy in aggregate.

Now, we can simply use fit_predict! to estimate the rewards (note that we are passing in the test data rather the training data to ensure we get a fair out-of-sample evaluation):

rewards = IAI.fit_predict!(reward_lnr, test_X, test_T, test_y)
10000×2 DataFrame
│ Row   │ A       │ B       │
│       │ Float64 │ Float64 │
├───────┼─────────┼─────────┤
│ 1     │ 10.0009 │ 10.0098 │
│ 2     │ 10.011  │ 10.0338 │
│ 3     │ 10.4951 │ 9.4398  │
│ 4     │ 7.1242  │ 10.833  │
│ 5     │ 9.98348 │ 10.0396 │
│ 6     │ 9.98867 │ 9.9873  │
│ 7     │ 10.0219 │ 10.0817 │
⋮
│ 9993  │ 8.29683 │ 11.7985 │
│ 9994  │ 9.90001 │ 10.011  │
│ 9995  │ 9.99642 │ 10.0434 │
│ 9996  │ 9.49554 │ 10.6932 │
│ 9997  │ 10.1661 │ 9.77065 │
│ 9998  │ 10.0045 │ 10.0075 │
│ 9999  │ 11.7479 │ 7.83962 │
│ 10000 │ 7.93974 │ 10.6485 │

Now we can evaluate the quality of the prescriptive policy by using rewards for the treatments assigned to each point in place of the true outcomes:

evaluate(grid, test_X, rewards)
Dict{Symbol,Float64} with 2 entries:
:treatment_accuracy => 0.7407
:outcome_accuracy   => 0.71615

## Comparison Between Approaches

We know that reducing the depth of the tree should also reduce the quality of the model, so let's check that it makes sense. We can also check if the evaluation using estimated rewards shows the same trends as the ground truth approach.

df = DataFrame()
for depth = 1:5
lnr = IAI.GridSearch(
IAI.OptimalTreePrescriptionMinimizer(
max_depth=depth,
random_seed=123,
),
)
IAI.fit!(lnr, X, treatments, outcomes)

true_results = evaluate(lnr, test_X, test_outcomes)
reward_results = evaluate(lnr, test_X, rewards)

append!(df, DataFrame(
depth=depth,
t_acc_truth=true_results[:treatment_accuracy],
t_acc_reward=reward_results[:treatment_accuracy],
o_acc_truth=true_results[:outcome_accuracy],
o_acc_reward=reward_results[:outcome_accuracy],
))
end
df
5×5 DataFrame
│ Row │ depth │ t_acc_truth │ t_acc_reward │ o_acc_truth │ o_acc_reward │
│     │ Int64 │ Float64     │ Float64      │ Float64     │ Float64      │
├─────┼───────┼─────────────┼──────────────┼─────────────┼──────────────┤
│ 1   │ 1     │ 0.3648      │ 0.593        │ 0.117393    │ 0.126124     │
│ 2   │ 2     │ 0.7167      │ 0.7381       │ 0.0708337   │ 0.0817437    │
│ 3   │ 3     │ 0.8576      │ 0.7166       │ 0.465927    │ 0.474563     │
│ 4   │ 4     │ 0.9114      │ 0.743        │ 0.663972    │ 0.669055     │
│ 5   │ 5     │ 0.8851      │ 0.7407       │ 0.707024    │ 0.71615      │

We can see that the results from the reward estimation approach largely follow the same trends as the results for the ground truth results, despite there being some differences in the raw numbers. This gives us confidence that we can evaluate, rank and select models using reward estimation in real-world settings where the ground truth approach is not possible.