Tips and Tricks

This page contains a number of best practices and suggestions for getting high-quality estimates when conducting reward estimation.

Training/Testing Setup


When estimating rewards for training Optimal Policy Trees, you should train and estimate the rewards on the training data in a single step using fit_predict!.

When using reward estimation for evaluating prescription policies, you should train and estimate the rewards on the test data in a single step using fit_predict!.

In order to simulate a fair out-of-sample evaluation, we need to prevent any information leaking from the training data into the estimated rewards. This ensures that the rewards used for evaluation are purely a function of the testing data.

For this reason, it is important that the rewards for the training and testing sets are estimated using a learner that is trained separately for each dataset. This way, no data leaks from training to testing, or vice-versa.

Ensure Sufficient Test Data


Make sure you include enough data in the test set to train high-quality reward estimation models

Unlike a normal predictive model evaluation, reward estimation requires training models on the test set as part of the evaluation. If there is not enough data in the test set, the quality of the reward estimation may suffer, leading to an evaluation of the prescriptive policy that may be inaccurate and unreasonable.

We recommend using much more data for the test set than usual for a prescriptive problem. Allocating 50% of the data for testing is a good starting point.


We need to make sure that reward estimates used for both training and testing are not in-sample predictions, because as with any model, the reward estimator may overfit to the data, resulting in overly-optimistic reward estimates.

To avoid this, reward estimators use an internal in-sample cross-fitting process when estimating the rewards. This means that the reward for each point is estimated using models that never saw this point in training, mitigating the risk of overfitting.

This cross-fitting is enabled by default, and the number of folds used during this cross-validation process is controlled by the learner parameters :propensity_insample_num_folds and :outcome_insample_num_folds. It should only be disabled (with extreme caution) if you know the corresponding model will not overfit the data.

Judging the Quality of Reward Estimation


Use the score(s) returned by fit_predict! to gauge the quality of the reward estimation process and to compare the relative quality of different internal estimators.

The quality of the policy evaluation or model training based on estimated rewards is only as good as the reward estimation process. When using fit_predict!, the score(s) of the internal estimators are provided as the second return value. These scores should be used to judge whether the reward estimation process is reliable enough to serve as the basis for training or evaluation. If the scores are too low, you can try conducting more parameter validation in the internal estimators, or try a different class of estimation method altogether.

Evaluating Prescriptive Policies Using Rewards

After conducting reward estimation, we can use the rewards for policy evaluation. Given a policy $\pi(\mathbf{x})$ that assigns treatments to points, we should evaluate the quality of a policy by summing the rewards of the assigned treatments for each point:

\[\sum_i \Gamma_i(\pi(\mathbf{x}_i))\]

If, and only if, the direct method is used for reward estimation, it is also possible to evaluate the performance of the policy by comparing to an oracle that picks the treatment with the best reward for each point. You might consider the following metrics:

  • the proportion of points where the policy prescription matches the oracle prescription
  • an assessment of the difference in rewards under the policy prescription and the oracle prescription, e.g. mean-squared error or mean-absolute error of the difference

You can only compare performance to an oracle if you estimated the rewards using the Direct Method. The other reward estimation methods only make sense when comparing the aggregate policy values

Clipping Propensity Scores


Small propensity predictions can lead to very large or infinite reward predictions with high variance. If this occurs, you should use the propensity_min_value parameter to control the minimum propensity value that can be predicted.

The inverse propensity weighting and doubly robust estimators divide by the predicted propensity as part of calculating the reward. This means that the estimated rewards can be highly sensitive to small variations in the underlying propensity predictions, and can also lead to very large values.

The propensity_min_value can be used to mitigate these large values with high variance, by clipping all propensity predictions below this value to instead use this value. For example, if set to 0.01, this will cause any propensities below 0.01 to be clipped and replaced with 0.01, mitigating the extreme effect that smaller values may have.

Tuning the Kernel Bandwidths

Care should be taken when tuning estimation_kernel_bandwidth and reward_kernel_bandwidth, as it is hard to assess the impact of these parameters on the quality of the final reward estimates.

We present a simple heuristic for tuning these values, but this is only general guidance and it is important to pay attention to the impact of these parameters on each individual problem.

1. Tune estimation_kernel_bandwidth

First, we start by tuning estimation_kernel_bandwidth according to the guidance offered by the scores returned by fit_predict!. This can be done by manually changing the parameter and observing the influence on the scores.

If the bandwidth is too small, we often see poor performance in both propensity and outcome models. This results from not enough data being included in the estimation problems at a given treatment candidate. For this reason, it is important to make sure the bandwidth is large enough such that all problems have sufficient amounts of data. This can be verified using get_estimation_densities, which returns the total treatment density around each candidate for the current bandwidth, which is roughly equivalent to the number of points being considered in each problem. If any candidate has a very low density, we may need to increase the bandwidth and/or remove this candidate due to lack of data.

Once the bandwidth gives sufficient data density in each problem, we expect the following trends:

  • the scores for propensities will increase as we increase the bandwidth, as including additional data from distant treatments makes it easier to predict which observations are unlikely to receive the candidate treatment
  • the scores for outcomes will decrease as we increase the bandwidth, as including additional data from distant treatments is not useful for predicting the outcome under the candidate treatment

This means we can observe and evaluate the tradeoff between the scores as a way to tune a good value for the bandwidth. Typically the results are not incredibly sensitive to the particular value of bandwidth chosen, as long as the scores for propensity and outcomes are reasonable.

2. Tune reward_kernel_bandwidth

Once a value for estimation_kernel_bandwidth has been chosen, we can tune the value for reward_kernel_bandwidth with the help of tune_reward_kernel_bandwidth.

This procedure requires a starting value for the bandwidth, and then uses the trained learner to estimate the reward kernel bandwidth that minimizes the mean-squared error of the reward estimates. Because this process depends on the starting value, we recommend passing a range of starting values to tune_reward_kernel_bandwidth and inspecting the results. The typical pattern we observe is that the tuned value initially increases with the starting value, before peaking and subsequently flattening out.

Given these results, we then recommend taking a value around the peak and then shrinking it to obtain the final value. For instance, if the peak value was around 2.5, we might shrink it to use a final value between 1.5 and 2.0 (the reason for this shrinkage is to try to reduce the bias in the final reward estimates, at the expense of a small increase in variance).

As with the estimation bandwidth, it is also possible to get poor results if the reward bandwidths is too small. We can use the insights from tuning the estimation bandwidth as guidance for when the reward bandwidth might be problematically small.

It may be the case that the results of the tuning do not seem to peak or plateau, or give significantly different values to the tuned estimation bandwidth. In this scenario, another option is to use a reward bandwidth close to the tuned estimation bandwidth.

3. Use tuned reward_kernel_bandwidth

Once we have selected a final value for reward_kernel_bandwidth, we can use it in the learner and generate reward predictions without retraining using set_reward_kernel_bandwidth!.