What is Reward Estimation?

One of the biggest issues faced when solving prescriptive problems is the question of how to evaluate the quality of a proposed prescription policy. This issue is most pronounced when dealing with so-called observational data, where we simply have the following collection of historical data:

  • $\mathbf{x}_i$ is the vector of observed covariates for the $i$th point
  • $T_i$ is the observed treatment(s) for the $i$th point
  • $y_i$ is the observed outcome for the $i$th point (corresponding to treatment $T_i$), which can be either a numeric or binary outcome

Our goal is to develop a prescriptive policy that assigns treatments $T$ to new observations based on their features $\mathbf{x}$ with the goal of minimizing or maximizing the resulting outcome $y$. If $y$ is a binary outcome, this corresponds to minimizing or maximizing the probability of the outcome occuring, whereas if $y$ is numeric, the policy will aim to minimize or maximize the numeric outcome.

The key difficulty with this approach is that a new policy will almost certainly prescribe treatments that are different to those observed in the data, and so we do not know what would have happened under the new treatments (these unknown outcomes are often called counterfactuals). Reward estimation is an approach that seeks to address this problem, by estimating the reward, denoted $\Gamma_i(t)$, that should be attributed to a policy for assigning a given treatment $t$ to a given observation $i$.

There are two main uses for reward estimation:

  1. Evaluating prescriptive policies: By estimating the reward for a given point under a given treatment, we are able to fairly evaluate the quality of a prescriptive policy, despite only observing outcomes under a single treatment.
  2. Estimating outcomes for training: Methods such as Optimal Policy Trees are trained using a rewards matrix containing the outcome for each point under each treatment. If we want to use these methods to learn from observational data, we will need to estimate the counterfactuals in order to complete the matrix.

Methods for Reward Estimation

The mechanism used to estimate rewards depends of the type of treatment being applied. We divide treatments into two categories, each with its own process for conducting reward estimation:

  1. Categorical treatments have two or more treatment options that are mutually exclusive and unrelated (for example, pick one treatment from A, B, or C).

    If there are multiple distinct treatment choices to make, we can treat each combination of treatments as its own treatment (for example, if we have to choose between A and B for the first treatment, and Y and Z for the second, we have four options: AY, AZ, BY and BZ).

    For categorical treatments, the categorical reward estimation process should be followed.

  2. Numeric treatments are treatments with numeric-valued doses (for example, the amount of treatment A to be given).

    If there are multiple distinct treatment choices to make, we combine the numeric doses into a vector of doses (for example, if we have a dose of 5 for treatment A and 3 for treatment B, the treatment would be [5, 3]).

    For numeric treatments, the numeric reward estimation process should be followed.

Evaluating Prescriptive Policies Using Rewards

After conducting reward estimation, we can use the rewards for policy evaluation. Given a policy $\pi(\mathbf{x})$ that assigns treatments to points, there are multiple ways to evaluate its quality.


Note that for observational data, the treatment assignments observed in the data often constitute a baseline policy. We can evaluate the quality of this policy to gain a standard of performance against which we can compare alternative policies.

Aggregate Policy Reward

The simplest way to evaluate the quality of a policy is by summing the rewards of the assigned treatments for each point:

\[\sum_i \Gamma_i(\pi(\mathbf{x}_i))\]

Comparison to Oracle

We can also evaluate the performance of the policy by comparing to an oracle that picks the treatment with the best reward for each point. You might consider the following metrics:

  • the proportion of points where the policy prescription matches the oracle prescription
  • an assessment of the difference in rewards under the policy prescription and the oracle prescription, e.g. mean-squared error or mean-absolute error of the difference

For categorical treatment problems, you can only compare performance to an oracle if you estimated the rewards using the Direct Method. The other reward estimation methods only make sense when comparing the aggregate policy values

Conducting Reward Estimation Correctly

Evaluating prescriptive policies is more complicated than evaluating predictive problems. In particular, we recommend the following steps to ensure that you get a fair and useful reward estimates.

Ensure Sufficient Test Data


Make sure you include enough data in the test set to train high-quality reward estimation models

Unlike a normal predictive model evaluation, reward estimation requires training models on the test set as part of the evaluation. If there is not enough data in the test set, the quality of the reward estimation may suffer, leading to an evaluation of the prescriptive policy that may be inaccurate and unreasonable.

We recommend using much more data for the test set than usual for a prescriptive problem. Allocating 50% of the data for testing is a good starting point.

Proper Estimation of Rewards


When estimating rewards for training Optimal Policy Trees, you should estimate the rewards using fit_predict! on the training data, and make sure in-sample cross-validation is used for estimation.

When using reward estimation for evaluating prescription policies, you should estimate the rewards using fit_predict! on the test data, and make sure in-sample cross-validation is used for estimation.

In order to simulate a fair out-of-sample evaluation, we need to prevent any information leaking from the training data into the estimated rewards. This ensures that the rewards used for evaluation are purely a function of the testing data.

We also need to make sure that reward estimates used for both training and testing are not in-sample predictions, as the reward estimator may overfit to the data, resulting in overly-optimistic reward estimates.

For these reasons, it is important that fit_predict! is used to fit the model and estimate the rewards in a single step, rather than using fit! followed by predict. This causes the reward estimator to use an internal in-sample cross-validation process when estimating the rewards, meaning that the reward for each point is never estimated using a model that saw this point in training, mitigating the risk of overfitting. The number of folds used during this cross-validation process is controlled by the learner parameters :propensity_estimation_insample_num_folds (categorical treatments only) and :outcome_estimation_insample_num_folds.

Evaluating Quality of Reward Estimation


Use the score(s) returned by fit_predict! to gauge the quality of the reward estimation process and to compare the relative quality of different internal estimators.

The quality of the policy evaluation or model training based on estimated rewards is only as good as the reward estimation process. When using fit_predict!, the score(s) of the internal estimators are provided as the second return value. These scores should be used to judge whether the reward estimation process is reliable enough to serve as the basis for training or evaluation. If the scores are too low, you can try conducting more parameter validation in the internal estimators, or try a different class of estimation method altogether.