# Reward Estimation with Categorical Treatments

This page gives an overview of different approaches to reward estimation with categorical treatments. For more information on these approaches, you can refer to papers in the causal inference literature such as Dudik et al. (2011).

In general, the doubly robust method should be preferred, but because it builds upon the direct method and inverse propensity weighting approaches, it is important to understand how these work.

## Approach 1: Direct Method

The *direct method* estimator is a naive approach for reward estimation that simply trains models to predict the outcome under each treatment.

Concretely, we train one model $f_t(\mathbf{x})$ for each treatment $t$ to predict $y$ as a function of $\mathbf{x}$, where each treatment model is trained using the subset of the data that received the corresponding treatment. We then estimate the rewards for each point $i$ under each treatment $t$ as:

\[\Gamma_i(t) = f_t(\mathbf{x}_i)\]

The quality of the reward estimation is thus determined solely by the quality of these outcome prediction models: if these models accurately and consistently predict the outcomes for each treatment based on the features, then the results can be strong. However, there are a number of ways the direct method can go wrong:

If there is bias in the observed treatment assignments (i.e. $\mathbf{x}$ and $T$ are correlated), then this bias can affect the integrity of the outcome models. For instance, if a treatment is overwhelmingly given to patients of a single ethnicity, the outcome model may not generalize well to the entire population.

The outcome prediction models are trained holistically, rather than being trained with prescription in mind. This means the models might focus attention on predicting outcomes in areas that are irrelevant for deciding which treatment is best to apply.

## Approach 2: Inverse Propensity Weighting

The *inverse propensity weighting* (IPW) estimator predicts the rewards by adjusting the observed outcomes in the data to account for any bias in treatment assignments.

Concretely, we train a single multi-class classification model $\ p(\mathbf{x}, t)$ to predict *propensity scores*, which are the probabilities that a given sample would be assigned treatment $t$ based on its features $\mathbf{x}$. We then estimate the rewards for each point $i$ under each treatment $t$ as:

\[\Gamma_i(t) = \Iota\{T_i = t\} \cdot \frac{1}{p(\mathbf{x}_i, t)} y_i\]

where $\Iota\{x\}$ denotes the indicator function (1 if $x$ is true and 0 otherwise).

This can be interpreted as reweighing the samples in each treatment subpopulation to correct for any bias in how the treatments were assigned. After this adjustment, each treatment subpopulation should have similar distributions of features, allowing us to more fairly compare outcomes between these groups.

The quality of the IPW approach depends on the strength of the probability estimation model. If we can estimate the treatment assignment probabilities consistently and accurately, the rewards can be accurate. However, there are a number of ways the IPW approach can go wrong:

- Because we divide by the probability estimates, minor variations in small probability estimates can cause large swings in the reward estimates. For this reason, the IPW estimator can often have high variance and be very sensitive to the particular training data and parameters used. The parameter
`propensity_min_value`

can often be used to control the degree of such variance. - If the data does not record certain features that were influential in assigning treatments, we may not be able to accurately estimate the propensity scores, limiting the quality of the reward estimations. This is known as the problem of
*unobserved confounders*.

## Approach 3: Doubly Robust

The *doubly robust* estimator is a natural combination of the direct method and inverse propensity weighting approaches that combines their respective strengths and covers their respective weaknesses.

Concretely, we train models as described earlier for both approaches:

- one model $f_t(\mathbf{x})$ for each treatment $t$ to predict $y$ as a function of $\mathbf{x}$ under this treatment
- one classification model $p(\mathbf{x}, t)$ to predict the probability that a sample is assigned treatment $t$ based on its features $\mathbf{x}$

We then combine the estimates from these models to estimate the rewards:

\[\Gamma_i(t) = f_t(\mathbf{x}_i) + \Iota\{T_i = t\} \cdot \frac{1}{p(\mathbf{x}_i, t)} (y_i - f_t(\mathbf{x}_i))\]

This can be seen as a combination of the direct method and the inverse propensity weighting approaches, where the direct method estimate is subsequently adjusted to account for any treatment assignment bias present in the observed data.

This estimator is shown to be accurate as long as either the probability and/or outcome predictions are accurate, hence the name *doubly robust*. In this way, it combines the strengths of the earlier estimators and mitigates their weaknesses.

## Sidenote: Survival Outcomes

When conducting reward estimation for problems with survival outcomes, the presence of censoring in the data can cause additional complications for the IPW and doubly robust approaches. This is because we do not observe the true survival time for censored observations, only a lower bound, and thus we do not have all of the values $y_i$ that are needed to calculate $\Gamma_i(t)$.

To account for this, in survival problems the reward estimation pipeline is automatically adjusted to account for the presence of censoring in the data following approaches presented in Cui et al. (2020). This behavior is controlled via the `censoring_adjustment_method`

parameter.