Reward Estimation with Categorical Treatments

Categorical Reward Estimation Process

When we have categorical treatments, the reward estimation process consists of three steps:

1. Estimating Propensity Scores

The propensity score$\ p_i(t)$, is the probability that a given point $i$ would be assigned treatment $t$ based on its features $\mathbf{x}_i$.

In a randomized experiment where treatments are randomly assigned to points, the propensity scores would be equal for all treatments.

In observational data, there is typically some non-random treatment assignment process that assigns treatments based on the observed features. In these cases, we can estimate the propensity scores by treating the problem as a multi-class classification problem, where we try to predict the assigned treatments using the features. This trained model can then estimate the probabilities of treatment assignment for each point.

2. Estimating Outcomes

The predicted outcome $f_i(t)$ is the outcome that we would predict to occur if point $i$ with features $\mathbf{x}_i$ were assigned treatment $t$. These are often referred to as counterfactuals or potential outcomes.

We can estimate the outcomes by training a separate model for each treatment to predict the outcomes under that treatment. For a given treatment, we train the model on the subset of the data that received this treatment. We can then apply each of the trained models in order to predict the outcomes under each treatment for each point.

The type of model used for outcome estimation depends on the type of outcome:

  • numeric outcomes are estimated using regression models to predict the outcome directly
  • binary outcomes are estimated using classification models to predict the probability of success

3. Estimating Rewards

Given estimates of the propensity scores and outcomes, there are different approaches for estimating the reward associated with assigning a given observation a given treatment. For more information on these approaches, you can refer to papers in the causal inference literature such as Dudik et al. (2011).


The inverse propensity weighting and doubly robust estimators divide by the predicted propensity as part of calculating the reward, meaning that small propensity predictions can lead to very large or infinite reward predictions. If this occurs, you should use the propensity_min_value parameter to control the minimum propensity value that can be predicted.

Direct Method

Argument name: :direct_method

The direct method estimator simply uses the predicted outcome under each treatment as the estimated reward:

\[\Gamma_i(t) = f_i(t)\]

The quality of the reward estimation is thus determined solely by the quality of the outcome prediction model. One potential issue with this approach is that the outcome prediction model is trained holistically, rather than being trained with prescription in mind. This means the model might focus attention on predicting outcomes in areas that are irrelevant for comparing the relative quality of treatments.

Inverse Propensity Weighting

Argument name: :inverse_propensity_weighting

The inverse propensity weighting estimator predicts the rewards by adjusting the observed treatments using estimates of the propensity score:

\[\Gamma_i(t) = \begin{cases} \frac{y_i}{p_i(t)}, & \text{if } t = T_i\\ 0, & \text{otherwise} \end{cases}\]

This can be interpreted as taking the observed outcomes and correcting for any shift in treatment assignment probabilities between the observed data and the policy being evaluated. The quality of this method is tied to the probability estimation model. In particular, small probability estimates can cause large reward estimates. It may also require a larger number of observations with good coverage of the possible treatments to ensure that the probability estimates are non-zero in most places.

Doubly Robust

Argument name: :doubly_robust

The doubly robust estimator uses both the predicted probabilities and predicted outcomes to estimate the rewards:

\[\Gamma_i(t) = \begin{cases} f_i(t) + \frac{y_i - f_i(t)}{p_i(t)}, & \text{if } t = T_i\\ f_i(t), & \text{otherwise} \end{cases}\]

It can be seen as a combination of the direct method and the inverse propensity weighting approaches. This estimator is shown to be accurate as long as either the probability and/or outcome predictions are accurate, hence the name doubly robust.

Learner for Categorical Reward Estimation

RewardEstimation provides the CategoricalRewardEstimator learner to easily conduct reward estimation on data with categorical treatments. In addition to the shared learner parameters, the following parameters are used to control the reward estimation procedure.


The learner to use for propensity score estimation. Any ClassificationLearner can be used, some recommended options include:

To conduct parameter validation during the propensity score fitting, a GridSearch over a classification learner can also be used.


A positive Integer indicating the number of folds to use when estimating the propensity scores in-sample during fit_predict!. Defaults to 5. Set to nothing to disable in-sample cross-validation (which should only be done if you are confident the model will not overfit).


A Real value between 0 and 1 that specifies the minimum propensity score that can be predicted for any treatment. Defaults to 0.


The learner to use for outcome estimation. The type of learner depends on the type of outcome in the problem:

To conduct parameter validation during the outcome learner fitting, a GridSearch over the appropriate learner can also be used.


A positive Integer indicating the number of folds to use when estimating the outcomes in-sample during fit_predict!. Defaults to 5. Set to nothing to disable in-sample cross-validation (which should only be done if you are confident the model will not overfit).


The method to use for reward estimation. The following options are available: