What is Reward Estimation?

One of the biggest issues faced when solving prescriptive problems is the question of how to evaluate the quality of a proposed prescription policy. This issue is most pronounced when dealing with so-called observational data, where we simply have the following collection of historical data:

$\mathbf{x}_i$ is the vector of observed covariates for the $i$th point
$T_i$ is the observed treatment(s) for the $i$th point
$y_i$ is the observed outcome for the $i$th point (corresponding to treatment $T_i$), which can be either a numeric, binary, or survival outcome

Our goal is to develop a prescriptive policy that assigns treatments $T$ to new observations based on their features $\mathbf{x}$ with the goal of minimizing or maximizing the resulting outcome $y$:

if $y$ is a binary outcome, this corresponds to minimizing or maximizing the probability of the outcome occuring
if $y$ is numeric, the policy will aim to minimize or maximize the numeric outcome
if $y$ is a survival outcome, the policy will minimize or maximize either the expected survival time or the survival probability at a given time

The key difficulty in this setting is that a new policy will almost certainly prescribe treatments that are different to those observed in the data, and so we do not know what would have happened under the new treatments (these unknown outcomes are often called counterfactuals). This makes it difficult to optimize or compare different prescription policies and assess their quality.

Reward estimation is an approach from the causal inference literature that seeks to address this problem, by utilizing the observed data to make inferences about the unobserved outcomes. Concretely, the goal of reward estimation is to estimate rewards, denoted $\Gamma_i(t)$, that should be attributed to a policy for assigning a given treatment $t$ to a given observation $i$, such that the sum of such rewards across all points reflects the relative quality of the prescriptive policy. Specifically, if we have a prescription policy $\pi(\mathbf{x})$ that assigns treatments to points based on their features $\mathbf{x}$, the quality of this policy is given by

\[\sum_i \Gamma_i(\pi(\mathbf{x}_i))\]

There are two main uses for reward estimation:

Evaluating prescriptive policies: By estimating the reward for a given point under a given treatment, we are able to fairly evaluate the quality of a prescriptive policy, despite only observing outcomes under a single treatment.
Estimating outcomes for training: Methods such as Optimal Policy Trees are trained using a rewards matrix containing the outcome for each point under each treatment. If we want to use these methods to learn from observational data, we will need to estimate the counterfactuals in order to complete the matrix.

Types of Reward Estimation

The specific flavor of reward estimation that we conduct depends on the types of both treatments and outcomes in the problem. The first step when conducting reward estimation is to examine the problem and identify these characteristics.

Different Treatment Types

We divide treatments into two categories, each with its own process for conducting reward estimation:

Categorical treatments have two or more treatment options that are mutually exclusive and unrelated (for example, pick one treatment from A, B, or C).
If there are multiple distinct treatment choices to make, we can treat each combination of treatments as its own treatment (for example, if we have to choose between A and B for the first treatment, and Y and Z for the second, we have four options: AY, AZ, BY and BZ).
Numeric treatments are treatments with numeric-valued doses (for example, the amount of treatment A to be given).
If there are multiple distinct treatment choices to make, we combine the numeric doses into a vector of doses (for example, if we have a dose of 5 for treatment A and 3 for treatment B, the treatment would be [5, 3]).

Different Outcome Types

Reward estimation can be conducted for various types of outcomes:

Numeric outcomes involve minimizing or maximizing a single numeric value (e.g. sales profit, lifespan), and are estimated using regression models
Binary outcomes involve minimizing or maximizing the probability of a discrete event occuring (e.g. maximizing chance of successful sale), and are estimated using clasification models
Survival outcomes involve minimizing or maximizing either the time until a discrete event occurs (e.g. maximizing expected survival time of patient) or the survival probability at a given time (e.g. maximizing 10-year survival probability), and are estimated using survival models. The choice of target is controlled via evaluation_time.

Conducting Reward Estimation

After identifying the type of treatment and outcome in the problem, we can proceed with the reward estimation. The process we follow depends primarily on the treatment type: