What is Reward Estimation?

One of the biggest issues faced when solving prescriptive problems is the question of how to evaluate the quality of a proposed prescription policy. The key issue is that a new policy will almost certainly prescribe treatments that are different to those observed in the data, and so we do not know what would have happened under the new treatments. Reward estimation is an approach that seeks to address this problem, by estimating the reward that should be attributed to a policy for assigning a given treatment to a given observation.

Throughout this guide we use the following notation:

  • $\mathbf{x}_i$ is the vector of observed covariates for the $i$th point
  • $T_i$ is the observed treatment for the $i$th point
  • $y_i$ is the observed outcome for the $i$th point (corresponding to treatment $T_i$)
  • $p_i(t)$ is the predicted probability that the $i$th point was assigned treatment $t$ in the original data, i.e., the propensity score
  • $f_i(t)$ is the predicted outcome for the $i$th point under treatment $t$
  • $\Gamma_i(t)$ is the estimated reward realized if the $i$th point is assigned treatment $t$

Reward Estimation Process

The reward estimation process consists of three steps.

1. Estimating Propensity Scores

The propensity score$\ p_i(t)$, is the probability that a given point $i$ would be assigned treatment $t$ based on its features $\mathbf{x}_i$.

In a randomized experiment where treatments are randomly assigned to points, the propensity scores would be equal for all treatments.

In observational data, there is typically some non-random treatment assignment process that assigns treatments based on the observed features. In these cases, we can estimate the propensity scores by treating the problem as a multi-class classification problem, where we try to predict the assigned treatments using the features. This trained model can then estimate the probabilities of treatment assignment for each point.

2. Estimating Outcomes

The predicted outcome $f_i(t)$ is the outcome that we would predict to occur if point $i$ with features $\mathbf{x}_i$ were assigned treatment $t$. These are often referred to as counterfactuals or potential outcomes.

We can estimate the outcomes by training a separate regression model for each treatment to predict the outcomes under that treatment. For a given treatment, we train the model on the subset of the data that received this treatment. We can then apply each of the trained regression models in order to predict the outcomes under each treatment for each point.

3. Estimating Rewards

Given estimates of the propensity scores and outcomes, there are different approaches for estimating the reward associated with assigning a given observation a given treatment. For more information on these approaches, you can refer to papers in the causal inference literature such as Dudik et al. (2011).


The inverse propensity weighting and doubly robust estimators divide by the predicted propensity as part of calculating the reward, meaning that small propensity predictions can lead to very large or infinite reward predictions. If this occurs, you should use the propensity_min_value parameter to control the minimum propensity value that can be predicted.

Direct Method

Argument name: :direct_method

The direct method estimator simply uses the predicted outcome under each treatment as the estimated reward:

\[\Gamma_i(t) = f_i(t)\]

The quality of the reward estimation is thus determined solely by the quality of the outcome prediction model. One potential issue with this approach is that the outcome prediction model is trained holistically, rather than being trained with prescription in mind. This means the model might focus attention on predicting outcomes in areas that are irrelevant for comparing the relative quality of treatments.

Inverse Propensity Weighting

Argument name: :inverse_propensity_weighting

The inverse propensity weighting estimator predicts the rewards by adjusting the observed treatments using estimates of the propensity score:

\[\Gamma_i(t) = \begin{cases} \frac{y_i}{p_i(t)}, & \text{if } t = T_i\\ 0, & \text{otherwise} \end{cases}\]

This can be interpreted as taking the observed outcomes and correcting for any shift in treatment assignment probabilities between the observed data and the policy being evaluated. The quality of this method is tied to the probability estimation model. In particular, small probability estimates can cause large reward estimates. It may also require a larger number of observations with good coverage of the possible treatments to ensure that the probability estimates are non-zero in most places.

Doubly Robust

Argument name: :doubly_robust

The doubly robust estimator uses both the predicted probabilities and predicted outcomes to estimate the rewards:

\[\Gamma_i(t) = \begin{cases} f_i(t) + \frac{y_i - f_i(t)}{p_i(t)}, & \text{if } t = T_i\\ f_i(t), & \text{otherwise} \end{cases}\]

It can be seen as a combination of the direct method and the inverse propensity weighting approaches. This estimator is shown to be accurate as long as either the probability and/or outcome predictions are accurate, hence the name doubly robust.

Evaluating Prescriptive Policies Using Rewards

Given a policy $\pi(\mathbf{x})$ that assigns treatments to points, we can evaluate its quality using the estimated rewards in multiple ways.

Note that for observational data, the current policy is given by the treatment assignments. We can evaluate the quality of this policy to gain a standard of performance against which we can compare alternative policies.

Aggregate Policy Reward

The simplest way to evaluate the quality of a policy is by summing the rewards of the assigned treatments for each point:

\[\sum_i \Gamma_i(\pi(\mathbf{x}_i))\]

Comparison to Oracle

If the rewards have been estimated using Direct Method, you can also evaluate the performance of the policy by comparing to an oracle that picks the treatment with the best reward for each point. You might consider the following metrics:

  • the proportion of points where the policy prescription matches the oracle prescription
  • an assessment of the difference in rewards under the policy prescription and the oracle prescription, e.g. mean-squared error or mean-absolute error of the difference

You can only evaluate performance by comparing to an oracle if you estimated the rewards using the Direct Method. The other reward estimation methods only make sense when comparing the aggregate policy values

Training and Testing Setup

Evaluating prescriptive policies is more complicated than evaluating predictive problems. In particular, we recommend the following steps to ensure that you get a fair and useful out-of-sample evaluation when evaluating prescriptive policies.

Ensure Sufficient Test Data


Make sure you include enough data in the test set to train high-quality reward estimation models

Unlike a normal predictive model evaluation, reward estimation requires training models on the test set as part of the evaluation. If there is not enough data in the test set, the quality of the reward estimation may suffer, leading to an evaluation of the prescriptive policy that may be inaccurate and unreasonable.

We recommend using much more data for the test set than usual for a prescriptive problem. Allocating 50% of the data for testing is a good starting point.

Proper Separation of Training and Test Sets


When using reward estimation for evaluating prescription policies, you should estimate the rewards using fit_predict! on the test data.

In order to simulate a fair out-of-sample evaluation, we need to prevent any information leaking from the training data into the estimated rewards. This ensures that the rewards used for evaluation are purely a function of the testing data.

In this circumstance, it is important that fit_predict! is used to fit the model and estimate the rewards in a single step, rather than using fit! followed by predict. This is because fit_predict! contains additional measures that reduce overfitting in the reward estimation process.