Scoring Criteria
There are various scoring criteria that you can use for training, tuning and evaluating the models. The criterion to use in the relevant functions is specified by passing the indicated Symbol
as the criterion
argument.
When evaluating performance using the score
function, the results are scaled so that a larger score is better, the baseline model receives a score of 0, and a perfect model would receive a score of 1. In this sense, the results from score are similar to $R^2$ for regression problems, where 0 indicates the model is no better than predicting the mean, and 1 is a perfect prediction for each point.
In the definitions below, $w_i$ is the sample weight assigned to the $i$th point (where $i = 1, \ldots, n$).
The available criteria depend on the problem type.
Classification
The classification criteria use the following notation:
- $K$ is the number of class labels
- $f_i$ is the predicted label for the $i$th point (assumed to be a value between 1 and $K$)
- $p_{ik}$ is the predicted probability that the $i$th point takes the $k$th label
- $y_i$ is the true label for the $i$th point
Misclassification
Argument name: :misclassification
Misclassification calculates the (weighted) proportion of points that are misclassified:
\[\frac{\sum_{i = 1}^n w_i \cdot \Iota(f_i \neq y_i)}{\sum_{i = 1}^n w_i}\]
where $\Iota(x)$ denotes the indicator function (1 if $x$ is true and 0 otherwise).
Gini Impurity
Argument name: :gini
The gini impurity measures how well the fitted probabilities match the empirical probability distribution:
\[\sum_{i = 1}^n w_i (1 - p_{iy_i})\]
Note: the gini impurity is equivalent to treating the output as a continuous value and calculating the mean-squared error. It is also equivalent to the Brier score. In particular, the Brier score decomposes into refinement and calibration components:
- refinement is related to the ordering of the predictions (similar to AUC)
- calibration relates to the reliability of the predicted probabilities
This means the gini impurity can often be a better measure of performance than AUC, as AUC only evaluates the ordering of the predictions. Additionally, many methods cannot train with AUC as the objective function, but can use the gini impurity, thus if AUC is the end metric, it is often the case that using gini as the criterion can improve results.
Entropy
Argument name: :entropy
The entropy criterion measures the goodness of the predictions using the concept of entropy from information theory:
\[- \sum_{i = 1}^n w_i \log (p_{iy_i})\]
it is not recommended to use the entropy criterion for validation or out-of-sample evaluation as the score can be extremely negative if even a single point in the new data has a predicted probability close to zero (due to the logarithm in the calculation).
Note: entropy is equivalent to the log loss and other related loss functions.
AUC
Can only be applied to classification problems with $K=2$ classes.
Argument name: :auc
The area under the curve is a holistic measure of quality that gives the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one.
Threshold-based
Can only be applied to classification problems with $K=2$ classes.
Argument names:
:sensitivity
:specificity
:positive_predictive_value
:negative_predictive_value
:accuracy
:f1score
The threshold-based criteria are a family of values based on the confusion matrix and are calculated based on the number of true/false positives and negatives in the prediction. The label to treat as the positive class must be supplied using the positive_label
keyword argument. The threshold
keyword argument allows you to specify the threshold that determines the prediction cutoff (if not specified it defaults to 0.5
, i.e. predict the class with the highest probability).
Hinge Loss (Classification)
Can only be applied to classification problems with $K=2$ classes.
Argument names:
:l1hinge
:l2hinge
The L1 hinge loss criterion scores the quality of predictions using a linear function that goes to zero once the classification of a sample is "sufficiently correct":
\[\sum_{i = 1}^n \max(0, 1 - \hat{y_i} \cdot d_i)\]
where $\hat{y_i}$ is -1 if $y_i$ is the first class, and +1 if $y_i$ is the second class, and $d_i$ is the raw output of the model's decision function, such that
\[P(\hat{y_i} = -1) = \frac{1}{1 + e ^ {d_i}}\]
The L2 hinge loss criterion is a variant of the L1 hinge loss that uses a squared loss instead of linear:
\[\sum_{i = 1}^n \frac{1}{2} \max(0, 1 - \hat{y_i} \cdot d_i) ^ 2\]
Regression
The regression criteria use the following notation:
- $f_i$ is the predicted value for the $i$th point
- $y_i$ is the true value for the $i$th point
Mean-squared error
Argument name: :mse
The mean-squared error measures the (weighted) mean of squared errors between the fitted and true values:
\[\frac{\sum_{i = 1}^n w_i \cdot (f_i - y_i)^2}{\sum_{i = 1}^n w_i}\]
Note that the result for this criterion is actually the $R^2$ due to the way scores are scaled as described at the top of this page.
Tweedie
Argument name: :tweedie
The Tweedie distribution is often used in insurance applications, where the total claim amount for a covered risk usually has a continuous distribution on positive values, except for the possibility of being exactly zero when the claim does not occur. The Tweedie distribution models the total claim amount as the sum of a Poisson number of independent Gamma variables. The Tweedie criterion measures the log-likelihood of the predicted values against this distribution:
\[\sum_{i = 1}^n w_i \left( - \frac{y_i e^{(1 - \rho) \log(f_i)}}{1 - \rho} + \frac{e^{(2 - \rho) \log(f_i)}}{2 - \rho} \right)\]
where $\rho$ is given by the tweedie_variance_power
parameter (between 1 and 2) which characterizes the distribution of the responses. Set closer to 2 to shift towards a gamma distribution and set closer to 1 to shift towards a Poisson distribution.
Hinge Loss (Regression)
Argument names:
:l1hinge
:l2hinge
The hinge loss criteria for regression problems score the quality of predictions using an adjusted absolute loss that goes to zero if the error of the prediction is lower than $\epsilon$ (given by the hinge_epsilon
parameter).
The L1 hinge loss simply uses this adjusted absolute error:
\[\sum_{i = 1}^n \max(0, |f_i - y_i| - \epsilon)\]
The L2 hinge loss criterion squares the adjusted absolute error:
\[\sum_{i = 1}^n \frac{1}{2} \max(0, |f_i - y_i| - \epsilon) ^ 2\]
Prescription
The prescription criteria use the following notation:
- $T_i$ is the observed treatment for the $i$th point
- $y_i$ is the observed outcome for the $i$th point (corresponding to treatment $T_i$)
- $f_i(t)$ is the predicted outcome for the $i$th point under treatment $t$
- $z_i$ is the predicted best treatment to prescribe for point $i$, i.e., the treatment $t$ with the best $f_i(t)$
Combined Performance
Argument name: :combined_performance
It is shown empirically in the Optimal Prescriptive Trees paper that focusing only on prescription quality may limit the quality of results in prescription problems. To address this, the combined performance criterion balances the quality of the prescriptions being made (the first term below) against the accuracy of the predicted outcomes against those outcomes that were actually observed (the second term below):
\[\mu \left( \sum_{i = 1}^n w_i f_i(z_i) \right) + (1 - \mu) \left( \sum_{i = 1}^n w_i (f_i(T_i) - y_i) ^ 2 \right)\]
where $\mu$ is given by the prescription_factor
parameter (between 0 and 1) and controls the tradeoff between the prescriptive and predictive components of the score.
Prescription Outcome
Argument name: :prescription_outcome
Identical to Combined Performance with prescription_factor
set to 1, so only considers optimizing the expected outcome of the prescriptions.
Prediction Accuracy
Argument name: :prediction_accuracy
Identical to Combined Performance with prescription_factor
set to 0, so only considers matching the predicted outcomes under the observed treatments with the observed outcomes.
Policy
The prescription criteria use the following notation:
- $r_i(t)$ is the reward for the $i$th point under treatment $t$
- $z_i$ is the treatment prescribed for point $i$
Reward
Argument name: :reward
The reward criterion seeks to optimize the total reward under the treatments prescribed by the policy:
\[\sum_{i = 1}^n w_i r_i(z_i)\]
Survival
The survival criteria use the following notation:
- $t_i$ is the last observation for the $i$th point
- $\delta_i \in \{0,1\}$ indicates whether the last observation for the $i$th point was a death ($\delta_i = 1$) or a censoring ($\delta_i = 0$)
- $\Lambda(t)$ is the cumulative hazard function for the training data, found using the Nelson-Aalen estimator
- $\theta_i$ is the fitted hazard coefficient for the $i$th point
- $f_i(t)$ is the fitted Kaplan-Meier curve estimate for the survival distribution of the $i$th point
- $g_i(t)$ is the fitted Kaplan-Meier curve estimate for the censoring distribution of the $i$th point
Local Full Likelihood
Argument name: :localfulllikelihood
The local full likelihood is the objective function used by LeBlanc and Crowley and assumes that the survival distribution for each observation is a function of the cumulative hazard function $\Lambda$ and a point-wise adjustment $\theta_i$:
\[\mathbb{P}(\text{survival time} \leq t) = 1 - e^{-\theta_i \Lambda(t)}\]
We then measure the log-likelihood of this survival function against the data:
\[\sum_{i = 1}^n w_i \biggl( \Lambda(t_i) \theta_i - \delta_i \left[ \log (\Lambda(t_i)) + \log (\theta_i) + 1 \right] \biggr)\]
Log Likelihood
Argument name: :loglikelihood
The log likelihood criterion uses the fitted survival curve for each point to calculate the log likelihood of the data:
\[\sum_{i = 1}^n -w_i \log(1 - f_i(t_i))\]
Integrated Brier Score
Argument name: :integratedbrier
The integrated Brier score is a variant of the Brier score for classification (see Gini Impurity) that was adapted by Graf et al. to deal with censored data:
\[\sum_{i = 1}^n w_i \biggl( \int_{0}^{t_i} \frac{(1 - f_i(t))^2}{g_i(t)} dt + \delta_i \int_{t_i}^{t_{\max}} \frac{f_i(t)^2}{g_i(t_i)} dt \biggr)\]
Harrell's c-statistic
Argument name: :harrell_c_statistic
Harrell's c-statistic is the most widely-used measure of goodness-of-fit for survival models, and is analogous to the AUC for classification problems.
Mathematically, the score gives the probability that the model assigns a higher probability of survival to the patient that actually lived longer (i.e. the patient with better survival outlook is correctly identified by the model) when presented with a random pair of patients from the dataset in question. Therefore, like AUC, a simple baseline of random guessing will get a score of 0.5, while a perfect model will achieve a score of 1.0.
For more information on concordance statistics and evaluation for survival problems, refer to the vignette on concordance from the R survival
package.
Classification Criteria at Specific Times
Argument names:
:auc
:sensitivity
:specificity
:positive_predictive_value
:negative_predictive_value
:accuracy
:f1score
It is also possible to evaluate the performance of a survival model at a particular point in time by treating it as a classification problem and using one of the classification criteria listed above.
Given a time at which to evaluate the model (specified using the evaluation_time
parameter), we first construct a binary classification task from the actual data in the following way:
- Dead by evaluation time: deaths that occur before or at the evaluation time
- Alive after evaluation time: deaths that occur after the evaluation time, and censored observations that occur at or after the evaluation time
- Unknown/Ignored: censored observations before the evaluation time are excluded from the evaluation because their status at the evaluation time is unknown
For each observation, we use the survival model to predict the probability that each sample will survive past the evaluation time, and then evaluate these predictions using the specified classification criteria.
Note that if you are using a threshold-based criterion, you can also control the prediction cutoff using the threshold
parameter in the normal way.
Imputation
As an unsupervised task, in imputation there is no concept of the predictions matching the target from a scoring perspective. Instead, we use criteria to measure the similarity between datasets after imputation:
- during training, we measure the similarity between the imputed datasets at each iteration
- during validation, we artificially censor values to then impute them and measure the similarity to the known true value
We use the following notation:
- $\mathcal M$ is the set of all $(i,j)$ where the $j$th feature of the $i$th point is being imputed
- $x_{ij}$ is the known value of the $j$th feature of the $i$th point (the imputed value from the previous iteration during training, and the value before censoring during validation)
- $f_{ij}$ is the imputed value for the $j$th feature of the $i$th point
Euclidean/L2 Distance
Argument name: :l2
The Euclidean distance calculates the squared distance between all imputed values and their known-true values:
\[\sum_{(i, j) \in \mathcal M} w_i (f_{ij} - x_{ij}) ^ 2\]
Manhattan/L1 Distance
Argument name: :l1
The Manhattan distance is the same as the euclidean criterion, except it uses the absolute distance rather than squared distance:
\[\sum_{(i, j) \in \mathcal M} w_i |f_{ij} - x_{ij}|\]