# Reward Estimation with Numeric Treatments

This page gives an overview of different approaches to reward estimation with numeric treatments. These approaches are extensions of the corresponding categorical treatments to handle numeric treatments, so we recommend reviewing the categorical-treatment approaches prior to reading this page.

The literature on reward estimation for numeric treatments is a lot more recent and less established than for categorical treatments. For further reading, the following papers deal with different flavors of numeric-treatment reward estimation:

As with categorical treatments, the doubly robust method should typically be preferred, but it again builds upon the direct method and inverse propensity weighting approaches, so it is also important to understand how these work.

All approaches for reward estimation with numeric treatments require us to specify a set of *candidate treatments* for which we want to generate rewards. These candidates can be chosen however we like (options include the most common treatments observed in the data, or evenly-spaced across the range of possible treatments), however there are some considerations to take into account:

- we need to have sufficient supporting data to estimate rewards for a given candidate treatment, so it may not be possible to generate high-quality reward estimates for treatment candidates that are far from frequently observed treatments
- each treatment candidate requires seperate estimation models, so increasing the number of candidates increases the amount of work required for reward estimation

## Approach 1: Direct Method

The *direct method* estimator is a naive approach for reward estimation that simply trains models to predict the outcome under each treatment candidate.

Similarly to the direct method for categorical treatments, we train one model $f_t(\mathbf{x})$ for each treatment candidate $t$ to predict $y$ as a function of $\mathbf{x}$. We then estimate the rewards for each point $i$ under each treatment candidate $t$ as:

\[\Gamma_i(t) = f_t(\mathbf{x}_i)\]

The key difference to the approach for categorical treatments is how we select the data for training the model for each treatment candidate. For categorical treatments, we simply trained using the subset of the data that received each treatment. However, for numeric treatments, treatments can be close without being exactly identical, and so subsetting based on equality is likely to throw out more data than necessary.

Instead, we use a kernel function to measure similarity between treatments in order to determine the subset of data to use for training. We measure similarity between two treatments $t_1$ and $t_2$ using the following function:

\[K_h(t_1 - t_2) = \frac{K(\frac{t_1 - t_2}{h})}{h}\]

where $K(u)$ is a kernel function, and $h$ is the *bandwidth*, controlling how the similarity scales as the distance between $t_1$ and $t_2$ increases.

For each treatment candidate $t$, we measure the similarity between $t$ and the observed treatment $T_i$ for each sample, and use these similarities as sample weights when training the outcome model $f_t(\mathbf{x})$:

\[w_i = K_h(T_i - t)\]

In this way, the outcome model for each treatment candidate will give more weight to outcomes for samples that received treatments closer to the candidate, and less or no weight to the outcomes for samples that received very different treatments.

## Approach 2: Inverse Propensity Weighting

The *inverse propensity weighting* (IPW) estimator predicts the rewards by adjusting the observed outcomes in the data to account for any bias in treatment assignments.

Similarly to the IPW estimator for categorical treatments, we predict *propensity scores*, which are the probabilities that a given sample would be assigned each candidate treatment $t$ based on its features $\mathbf{x}$. We then estimate the rewards for each point $i$ under each treatment $t$ as:

\[\Gamma_i(t) = K_h(T_i - t) \cdot \frac{1}{p_t(\mathbf{x}_i)} y_i\]

The first difference to the approach for categorical treatments is that we have replaced the indicator function $\Iota\{T_i = t\}$ with a kernel similarity function $K_h(T_i - t)$ to adapt the adjustment term to work with numeric values.

The other difference is that we cannot use a multi-class classification model to predict the treatment assignment probabilities, as the treatments are now one or more numeric values.

Instead, we train a single regression model $p_t(\mathbf{x})$ per treatment candidate $t$ to estimate the probability of receiving this treatment as a function of $\mathbf{x}$. When constructing the target for training these models, we will not simply use an indicator function $\Iota\{T_i = t\}$, as this could result in discarding more data than necessary, and also does not respect the similarity of treatments that are close but not identical. Similarly to the direct method, we instead use similarity as measured by a kernel function to construct the target for this problem, and try to predict $q$ as a function of $\mathbf{x}$, where

\[q_i = K_h(T_i - t)\]

In this way, the propensity model for each treatment candidate learns to predict the similarity between the observed treatments and the candidate as a function of the features.

It is not a requirement that the two kernel functions $K_h(T_i - t)$ are the same function, and in fact it is common for the propensity estimation step and reward calculation step to use different kernels.

## Approach 3: Doubly Robust

The *doubly robust* estimator is a natural combination of the direct method and inverse propensity weighting approaches that combines their respective strengths and covers their respective weaknesses.

Similarly to the doubly robust estimator for categorical treatments, we train models as described earlier for both approaches:

- one model $f_t(\mathbf{x})$ for each candidate treatment $t$ to predict $y$ as a function of $\mathbf{x}$ under this treatment
- one regression model $p_t(\mathbf{x})$ for each candidate treatment $t$ to predict the probability that a sample is assigned this treatment based on its features $\mathbf{x}$

We then combine the estimates from these models to estimate the rewards:

\[\Gamma_i(t) = f_t(\mathbf{x}_i) + K_h(T_i - t) \cdot \frac{1}{p_t(\mathbf{x}_i)} (y_i - f_t(\mathbf{x}_i))\]

Similarly to the IPW estimator for numeric treatments, we have adapted the categorical-treatment doubly robust approach to handle numeric treatments by replacing the indicator function $\Iota\{T_i = t\}$ with a kernel similarity function $K_h(T_i - t)$.

As with categorical treatments, this estimator is shown to be accurate as long as either the probability and/or outcome predictions are accurate, hence the name *doubly robust*. In this way, it combines the strengths of the earlier estimators and mitigates their weaknesses.