Scoring Criteria

There are various scoring criteria that you can use for training, tuning and evaluating the models. The criterion to use in the relevant functions is specified by passing the indicated Symbol as the criterion argument.

When evaluating performance using the score function, the results are scaled so that a larger score is better, the baseline model receives a score of 0, and a perfect model would receive a score of 1. In this sense, the results from score are similar to $R^2$ for regression problems, where 0 indicates the model is no better than predicting the mean, and 1 is a perfect prediction for each point.

In the definitions below, $w_i$ is the sample weight assigned to the $i$th point (where $i = 1, \ldots, n$).

The available criteria depend on the problem type.

Classification

The classification criteria use the following notation:

  • $K$ is the number of class labels
  • $f_i$ is the predicted label for the $i$th point (assumed to be a value between 1 and $K$)
  • $p_{ik}$ is the predicted probability that the $i$th point takes the $k$th label
  • $y_i$ is the true label for the $i$th point

Misclassification

Argument name: :misclassification

Misclassification calculates the (weighted) proportion of points that are misclassified:

\[\frac{\sum_{i = 1}^n w_i \cdot \Iota(f_i \neq y_i)}{\sum_{i = 1}^n w_i}\]

where $\Iota(x)$ denotes the indicator function (1 if $x$ is true and 0 otherwise).

Gini Impurity

Argument name: :gini

The gini impurity measures how well the fitted probabilities match the empirical probability distribution:

\[\sum_{i = 1}^n w_i (1 - p_{iy_i})\]

Note: the gini impurity is equivalent to treating the output as a continuous value and calculating the mean-squared error. It is also equivalent to the Brier score. In particular, the Brier score decomposes into refinement and calibration components:

  • refinement is related to the ordering of the predictions (similar to AUC)
  • calibration relates to the reliability of the predicted probabilities

This means the gini impurity can often be a better measure of performance than AUC, as AUC only evaluates the ordering of the predictions. Additionally, many methods cannot train with AUC as the objective function, but can use the gini impurity, thus if AUC is the end metric, it is often the case that using gini as the criterion can improve results.

Entropy

Argument name: :entropy

The entropy criterion measures the goodness of the predictions using the concept of entropy from information theory:

\[- \sum_{i = 1}^n w_i \log (p_{iy_i})\]

Warning

it is not recommended to use the entropy criterion for validation or out-of-sample evaluation as the score can be extremely negative if even a single point in the new data has a predicted probability close to zero (due to the logarithm in the calculation).

Note: entropy is equivalent to the log loss and other related loss functions.

AUC

Info

Can only be applied to classification problems with $K=2$ classes.

Argument name: :auc

The area under the curve is a holistic measure of quality that gives the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one.

Threshold-based

Info

Can only be applied to classification problems with $K=2$ classes.

Argument names:

  • :sensitivity
  • :specificity
  • :positive_predictive_value
  • :negative_predictive_value
  • :accuracy
  • :f1score

The threshold-based criteria are a family of values based on the confusion matrix and are calculated based on the number of true/false positives and negatives in the prediction. The label to treat as the positive class must be supplied using the positive_label keyword argument. The threshold keyword argument allows you to specify the threshold that determines the prediction cutoff (if not specified it defaults to 0.5, i.e. predict the class with the highest probability).

Hinge Loss (Classification)

Info

Can only be applied to classification problems with $K=2$ classes.

Argument names:

  • :l1hinge
  • :l2hinge

The L1 hinge loss criterion scores the quality of predictions using a linear function that goes to zero once the classification of a sample is "sufficiently correct":

\[\sum_{i = 1}^n \max(0, 1 - \hat{y_i} \cdot d_i)\]

where $\hat{y_i}$ is -1 if $y_i$ is the first class, and +1 if $y_i$ is the second class, and $d_i$ is the raw output of the model's decision function, such that

\[P(\hat{y_i} = -1) = \frac{1}{1 + e ^ {d_i}}\]

The L2 hinge loss criterion is a variant of the L1 hinge loss that uses a squared loss instead of linear:

\[\sum_{i = 1}^n \frac{1}{2} \max(0, 1 - \hat{y_i} \cdot d_i) ^ 2\]

Regression

The regression criteria use the following notation:

  • $f_i$ is the predicted value for the $i$th point
  • $y_i$ is the true value for the $i$th point

Mean-squared error

Argument name: :mse

The mean-squared error measures the (weighted) mean of squared errors between the fitted and true values:

\[\frac{\sum_{i = 1}^n w_i \cdot (f_i - y_i)^2}{\sum_{i = 1}^n w_i}\]

Note that the result for this criterion is actually the $R^2$ due to the way scores are scaled as described at the top of this page.

Tweedie

Argument name: :tweedie

The Tweedie distribution is often used in insurance applications, where the total claim amount for a covered risk usually has a continuous distribution on positive values, except for the possibility of being exactly zero when the claim does not occur. The Tweedie distribution models the total claim amount as the sum of a Poisson number of independent Gamma variables. The Tweedie criterion measures the log-likelihood of the predicted values against this distribution:

\[\sum_{i = 1}^n w_i \left( - \frac{y_i e^{(1 - \rho) \log(f_i)}}{1 - \rho} + \frac{e^{(2 - \rho) \log(f_i)}}{2 - \rho} \right)\]

where $\rho$ is given by the tweedie_variance_power parameter (between 1 and 2) which characterizes the distribution of the responses. Set closer to 2 to shift towards a gamma distribution and set closer to 1 to shift towards a Poisson distribution.

Hinge Loss (Regression)

Argument names:

  • :l1hinge
  • :l2hinge

The hinge loss criteria for regression problems score the quality of predictions using an adjusted absolute loss that goes to zero if the error of the prediction is lower than $\epsilon$ (given by the hinge_epsilon parameter).

The L1 hinge loss simply uses this adjusted absolute error:

\[\sum_{i = 1}^n \max(0, |f_i - y_i| - \epsilon)\]

The L2 hinge loss criterion squares the adjusted absolute error:

\[\sum_{i = 1}^n \frac{1}{2} \max(0, |f_i - y_i| - \epsilon) ^ 2\]

Prescription

The prescription criteria use the following notation:

  • $T_i$ is the observed treatment for the $i$th point
  • $y_i$ is the observed outcome for the $i$th point (corresponding to treatment $T_i$)
  • $f_i(t)$ is the predicted outcome for the $i$th point under treatment $t$
  • $z_i$ is the predicted best treatment to prescribe for point $i$, i.e., the treatment $t$ with the best $f_i(t)$

Combined Performance

Argument name: :combined_performance

It is shown empirically in the Optimal Prescriptive Trees paper that focusing only on prescription quality may limit the quality of results in prescription problems. To address this, the combined performance criterion balances the quality of the prescriptions being made (the first term below) against the accuracy of the predicted outcomes against those outcomes that were actually observed (the second term below):

\[\mu \left( \sum_{i = 1}^n w_i f_i(z_i) \right) + (1 - \mu) \left( \sum_{i = 1}^n w_i (f_i(T_i) - y_i) ^ 2 \right)\]

where $\mu$ is given by the prescription_factor parameter (between 0 and 1) and controls the tradeoff between the prescriptive and predictive components of the score.

Prescription Outcome

Argument name: :prescription_outcome

Identical to Combined Performance with prescription_factor set to 1, so only considers optimizing the expected outcome of the prescriptions.

Prediction Accuracy

Argument name: :prediction_accuracy

Identical to Combined Performance with prescription_factor set to 0, so only considers matching the predicted outcomes under the observed treatments with the observed outcomes.

Policy

The prescription criteria use the following notation:

  • $r_i(t)$ is the reward for the $i$th point under treatment $t$
  • $z_i$ is the treatment prescribed for point $i$

Reward

Argument name: :reward

The reward criterion seeks to optimize the total reward under the treatments prescribed by the policy:

\[\sum_{i = 1}^n w_i r_i(z_i)\]

Survival

The survival criteria use the following notation:

  • $t_i$ is the last observation for the $i$th point
  • $\delta_i \in \{0,1\}$ indicates whether the last observation for the $i$th point was a death ($\delta_i = 1$) or a censoring ($\delta_i = 0$)
  • $\Lambda(t)$ is the cumulative hazard function for the training data, found using the Nelson-Aalen estimator
  • $\theta_i$ is the fitted hazard coefficient for the $i$th point
  • $f_i(t)$ is the fitted Kaplan-Meier curve estimate for the survival distribution of the $i$th point
  • $g_i(t)$ is the fitted Kaplan-Meier curve estimate for the censoring distribution of the $i$th point

Local Full Likelihood

Argument name: :localfulllikelihood

The local full likelihood is the objective function used by LeBlanc and Crowley and assumes that the survival distribution for each observation is a function of the cumulative hazard function $\Lambda$ and a point-wise adjustment $\theta_i$:

\[\mathbb{P}(\text{survival time} \leq t) = 1 - e^{-\theta_i \Lambda(t)}\]

We then measure the log-likelihood of this survival function against the data:

\[\sum_{i = 1}^n w_i \biggl( \Lambda(t_i) \theta_i - \delta_i \left[ \log (\Lambda(t_i)) + \log (\theta_i) + 1 \right] \biggr)\]

Log Likelihood

Argument name: :loglikelihood

The log likelihood criterion uses the fitted survival curve for each point to calculate the log likelihood of the data:

\[\sum_{i = 1}^n -w_i \log(1 - f_i(t_i))\]

Integrated Brier Score

Argument name: :integratedbrier

The integrated Brier score is a variant of the Brier score for classification (see Gini Impurity) that was adapted by Graf et al. to deal with censored data:

\[\sum_{i = 1}^n w_i \biggl( \int_{0}^{t_i} \frac{(1 - f_i(t))^2}{g_i(t)} dt + \delta_i \int_{t_i}^{t_{\max}} \frac{f_i(t)^2}{g_i(t_i)} dt \biggr)\]

Harrell's c-statistic

Argument name: :harrell_c_statistic

Harrell's c-statistic is the most widely-used measure of goodness-of-fit for survival models, and is analogous to the AUC for classification problems.

Mathematically, the score gives the probability that the model assigns a higher probability of survival to the patient that actually lived longer (i.e. the patient with better survival outlook is correctly identified by the model) when presented with a random pair of patients from the dataset in question. Therefore, like AUC, a simple baseline of random guessing will get a score of 0.5, while a perfect model will achieve a score of 1.0.

For more information on concordance statistics and evaluation for survival problems, refer to the vignette on concordance from the R survival package.

Classification Criteria at Specific Times

Argument names:

  • :auc
  • :sensitivity
  • :specificity
  • :positive_predictive_value
  • :negative_predictive_value
  • :accuracy
  • :f1score

It is also possible to evaluate the performance of a survival model at a particular point in time by treating it as a classification problem and using one of the classification criteria listed above.

Given a time at which to evaluate the model (specified using the evaluation_time parameter), we first construct a binary classification task from the actual data in the following way:

  • Dead by evaluation time: deaths that occur before or at the evaluation time
  • Alive after evaluation time: deaths that occur after the evaluation time, and censored observations that occur at or after the evaluation time
  • Unknown/Ignored: censored observations before the evaluation time are excluded from the evaluation because their status at the evaluation time is unknown

For each observation, we use the survival model to predict the probability that each sample will survive past the evaluation time, and then evaluate these predictions using the specified classification criteria.

Note that if you are using a threshold-based criterion, you can also control the prediction cutoff using the threshold parameter in the normal way.

Imputation

As an unsupervised task, in imputation there is no concept of the predictions matching the target from a scoring perspective. Instead, we use criteria to measure the similarity between datasets after imputation:

  • during training, we measure the similarity between the imputed datasets at each iteration
  • during validation, we artificially censor values to then impute them and measure the similarity to the known true value

We use the following notation:

  • $\mathcal M$ is the set of all $(i,j)$ where the $j$th feature of the $i$th point is being imputed
  • $x_{ij}$ is the known value of the $j$th feature of the $i$th point (the imputed value from the previous iteration during training, and the value before censoring during validation)
  • $f_{ij}$ is the imputed value for the $j$th feature of the $i$th point

Euclidean/L2 Distance

Argument name: :l2

The Euclidean distance calculates the squared distance between all imputed values and their known-true values:

\[\sum_{(i, j) \in \mathcal M} w_i (f_{ij} - x_{ij}) ^ 2\]

Manhattan/L1 Distance

Argument name: :l1

The Manhattan distance is the same as the euclidean criterion, except it uses the absolute distance rather than squared distance:

\[\sum_{(i, j) \in \mathcal M} w_i |f_{ij} - x_{ij}|\]