Reward Estimation with Categorical Treatments
Categorical Reward Estimation Process
When we have categorical treatments, the reward estimation process consists of three steps:
1. Estimating Propensity Scores
The propensity score$\ p_i(t)$, is the probability that a given point $i$ would be assigned treatment $t$ based on its features $\mathbf{x}_i$.
In a randomized experiment where treatments are randomly assigned to points, the propensity scores would be equal for all treatments.
In observational data, there is typically some non-random treatment assignment process that assigns treatments based on the observed features. In these cases, we can estimate the propensity scores by treating the problem as a multi-class classification problem, where we try to predict the assigned treatments using the features. This trained model can then estimate the probabilities of treatment assignment for each point.
2. Estimating Outcomes
The predicted outcome $f_i(t)$ is the outcome that we would predict to occur if point $i$ with features $\mathbf{x}_i$ were assigned treatment $t$. These are often referred to as counterfactuals or potential outcomes.
We can estimate the outcomes by training a separate model for each treatment to predict the outcomes under that treatment. For a given treatment, we train the model on the subset of the data that received this treatment. We can then apply each of the trained models in order to predict the outcomes under each treatment for each point.
The type of model used for outcome estimation depends on the type of outcome:
- numeric outcomes are estimated using regression models to predict the outcome directly
- binary outcomes are estimated using classification models to predict the probability of success
3. Estimating Rewards
Given estimates of the propensity scores and outcomes, there are different approaches for estimating the reward associated with assigning a given observation a given treatment. For more information on these approaches, you can refer to papers in the causal inference literature such as Dudik et al. (2011).
The inverse propensity weighting and doubly robust estimators divide by the predicted propensity as part of calculating the reward, meaning that small propensity predictions can lead to very large or infinite reward predictions. If this occurs, you should use the propensity_min_value
parameter to control the minimum propensity value that can be predicted.
Direct Method
Argument name: :direct_method
The direct method estimator simply uses the predicted outcome under each treatment as the estimated reward:
\[\Gamma_i(t) = f_i(t)\]
The quality of the reward estimation is thus determined solely by the quality of the outcome prediction model. One potential issue with this approach is that the outcome prediction model is trained holistically, rather than being trained with prescription in mind. This means the model might focus attention on predicting outcomes in areas that are irrelevant for comparing the relative quality of treatments.
Inverse Propensity Weighting
Argument name: :inverse_propensity_weighting
The inverse propensity weighting estimator predicts the rewards by adjusting the observed treatments using estimates of the propensity score:
\[\Gamma_i(t) = \begin{cases} \frac{y_i}{p_i(t)}, & \text{if } t = T_i\\ 0, & \text{otherwise} \end{cases}\]
This can be interpreted as taking the observed outcomes and correcting for any shift in treatment assignment probabilities between the observed data and the policy being evaluated. The quality of this method is tied to the probability estimation model. In particular, small probability estimates can cause large reward estimates. It may also require a larger number of observations with good coverage of the possible treatments to ensure that the probability estimates are non-zero in most places.
Doubly Robust
Argument name: :doubly_robust
The doubly robust estimator uses both the predicted probabilities and predicted outcomes to estimate the rewards:
\[\Gamma_i(t) = \begin{cases} f_i(t) + \frac{y_i - f_i(t)}{p_i(t)}, & \text{if } t = T_i\\ f_i(t), & \text{otherwise} \end{cases}\]
It can be seen as a combination of the direct method and the inverse propensity weighting approaches. This estimator is shown to be accurate as long as either the probability and/or outcome predictions are accurate, hence the name doubly robust.
Learner for Categorical Reward Estimation
RewardEstimation provides the CategoricalRewardEstimator
learner to easily conduct reward estimation on data with categorical treatments. In addition to the shared learner parameters, the following parameters are used to control the reward estimation procedure.
propensity_estimator
The learner to use for propensity score estimation. Any ClassificationLearner
can be used, some recommended options include:
RandomForestClassifier
orXGBoostClassifier
to estimate propensity scores using a accurate black-box modelsEqualPropensityEstimator
to estimate equal probabilities for each treatment (for data from randomized experiments where treatments are randomly assigned)
To conduct parameter validation during the propensity score fitting, a GridSearch
over a classification learner can also be used.
propensity_insample_num_folds
A positive Integer
indicating the number of folds to use when estimating the propensity scores in-sample during fit_predict!
. Defaults to 5. Set to nothing
to disable in-sample cross-validation (which should only be done if you are confident the model will not overfit).
propensity_min_value
A Real
value between 0 and 1 that specifies the minimum propensity score that can be predicted for any treatment. Defaults to 0.
outcome_estimator
The learner to use for outcome estimation. The type of learner depends on the type of outcome in the problem:
- if the outcomes $y_i$ are numeric, we estimate the outcome directly using a
RegressionLearner
such asRandomForestRegressor
,XGBoostRegressor
, orGLMNetCVRegressor
- if the outcomes $y_i$ are binary, we estimate the probability of success using a
ClassificationLearner
such asRandomForestClassifier
, orXGBoostClassifier
To conduct parameter validation during the outcome learner fitting, a GridSearch
over the appropriate learner can also be used.
outcome_insample_num_folds
A positive Integer
indicating the number of folds to use when estimating the outcomes in-sample during fit_predict!
. Defaults to 5. Set to nothing
to disable in-sample cross-validation (which should only be done if you are confident the model will not overfit).
reward_estimator
The method to use for reward estimation. The following options are available:
:direct_method
to use the direct method estimator:inverse_propensity_weighting
to use the inverse propensity weighting estimator:doubly_robust
to use the doubly robust estimator