Predicting Risk of Loan Default

In this example, we use the dataset from the FICO Explainable Machine Learning Challenge to compare the performance of Optimal Trees to XGBoost, and also compare the interpretability of the resulting trees to other approaches for model explainability (LIME and SHAP).

The challenge used a dataset of home equity line of credit (HELOC) applications made by real homeowners. A HELOC is a line of credit typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and its purchase price). The customers in this dataset have requested a credit line, and the fundamental task is to use the information about the applicant in their credit report to predict whether they will repay their HELOC account within 2 years. This prediction is then used to decide whether the homeowner qualifies for a line of credit and, if so, how much credit should be extended.

Due to the financial nature of the dataset, FICO posed the challenge of generating high-quality, explainable predictions using this data. When a credit decision is made, the consumer is entitled by regulations to an explanation for any adverse decisions, and thus model interpretability is a must-have for any AI solution to be used in this space.


In this example, we are not going to give great attention to tuning either method for performance, we will just conduct some very rough parameter tuning to give reasonable models for interpretation.

Data Preparation

Read in the data, treating -8 (No Usable/Valid Trade or Inquiry) and -9 (No Record) as missing:

using CSV, DataFrames
df = DataFrame(CSV.File("heloc.csv", missingstrings=["-9", "-8"]))
10459×24 DataFrame. Omitted printing of 21 columns
│ Row   │ RiskPerformance │ ExternalRiskEstimate │ MSinceOldestTradeOpen │
│       │ String          │ Int64?               │ Union{Missing, Int64} │
│ 1     │ Bad             │ 55                   │ 144                   │
│ 2     │ Bad             │ 61                   │ 58                    │
│ 3     │ Bad             │ 67                   │ 66                    │
│ 4     │ Bad             │ 66                   │ 169                   │
│ 5     │ Bad             │ 81                   │ 333                   │
│ 6     │ Bad             │ 59                   │ 137                   │
│ 7     │ Good            │ 54                   │ 88                    │
│ 10452 │ Good            │ 79                   │ 133                   │
│ 10453 │ Good            │ 90                   │ 197                   │
│ 10454 │ Bad             │ 75                   │ 410                   │
│ 10455 │ Good            │ 73                   │ 131                   │
│ 10456 │ Bad             │ 65                   │ 147                   │
│ 10457 │ Bad             │ 74                   │ 129                   │
│ 10458 │ Bad             │ 72                   │ 234                   │
│ 10459 │ Bad             │ 66                   │ 28                    │

Delete 588 rows where all features are missing:

inds = findall(i -> !all(ismissing.(values(df[i, 2:end]))), 1:nrow(df))
df = df[inds, :]

There are two ordinal features in the data, which we will tidy up by converting numeric codes to strings and then encoding as an ordered CategoricalVector:

df.MaxDelq2PublicRecLast12M = map(df.MaxDelq2PublicRecLast12M) do x
    if x == 0
        "Derogatory Comment"
    elseif x == 1
        "120+ Days Delinquent"
    elseif x == 2
        "90 Days Delinquent"
    elseif x == 3
        "60 Days Delinquent"
    elseif x == 4
        "30 Days Delinquent"
    elseif x == 7
        "Never Delinquent"
df.MaxDelqEver = map(df.MaxDelqEver) do x
    if x == 2
        "Derogatory Comment"
    elseif x == 3
        "120+ Days Delinquent"
    elseif x == 4
        "90 Days Delinquent"
    elseif x == 5
        "60 Days Delinquent"
    elseif x == 6
        "30 Days Delinquent"
    elseif x == 8
        "Never Delinquent"

for col in (:MaxDelq2PublicRecLast12M, :MaxDelqEver)
  df[!, col] = CategoricalVector(df[!, col], ordered=true)
  levels!(df[!, col], [
      "Never Delinquent",
      "30 Days Delinquent",
      "60 Days Delinquent",
      "90 Days Delinquent",
      "120+ Days Delinquent",
      "Derogatory Comment",

There are two numeric columns with an additional special value -7 (Condition not Met). We will treat these as mixed numeric/categoric features:

for colname in [:MSinceMostRecentDelq, :MSinceMostRecentInqexcl7days]
    col = replace(df[!, colname], -7 => "Condition not Met")
    df[!, colname] = IAI.make_mixed_data(col)

With all manipulation complete, we split data into training and test sets:

X = df[:, 2:end]
y = df[:, 1]
seed = 100
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,

Optimal Trees

Our first approach to predicting loan defaults is using Optimal Trees. We fit an optimal tree model, validating over the depth of the tree.

grid = IAI.GridSearch(
IAI.fit_cv!(grid, train_X, train_y, validation_criterion=:auc, n_folds=5)
IAI.set_display_label!(IAI.get_learner(grid), "Bad")
Optimal Trees Visualization

We can evaluate the out-of-sample performance:

IAI.score(grid, test_X, test_y, criterion=:misclassification)
IAI.score(grid, test_X, test_y, criterion=:auc)

Model Interpretation

The tree visualization can easily be inspected to examine the transparent logic behind the predictions made by the tree. You can also examine the path through the tree for any point using the printpath function. Here, we look at the explanation for the first point:

IAI.print_path(grid.lnr, test_X, 1)
Rules used to predict sample 1:
  1) Split: ExternalRiskEstimate (=67) < 75.5 or is missing
    2) Split: ExternalRiskEstimate (=67) < 68.5 or is missing
      3) Split: MSinceMostRecentInqexcl7days (=0) in [Condition not Met] or < 0.5
        4) Predict: Bad (P(Bad) = 80.83%), [1657,393], 2050 points, error 0.1917

We can see that the first sample is predicted as an 81% risk of being a bad loan because:

  1. the external risk score estimate of 67 is below the thresholds of 75.5, and 68.5, increasing the risk each time;
  2. the number of months since the most recent inquiry was 0, below the threshold of 0.5; increasing the risk;
  3. there are 2050 loans in the training set similar to this given point, based on satisfying these criteria, and 1657 (81%) of these were bad.

We can summarize these conditions to conclude that applicants with an external risk score of 68 or lower and less than one month since the last credit inquiry have an 81% chance of being a bad loan. This is a succinct, straightforward and understandable justification for the prediction being made, and only uses two of the features in the dataset to make this choice.

XGBoost with LIME/Shap

Next, we are going to use XGBoost to make predictions on the data, and then use LIME and SHAP to generate explanations for these predictions.

Data Preparation

LIME and SHAP are implemented in Python, so we are going to call these from Julia using PyCall.jl, which makes it easy to access Python libraries inside Julia.

XGBoost does not support the mixed-data encoding we used for Optimal Trees, so we need to re-encode the mixed data. First, we undo the mixed data transformation so that we can pass the data to Python:

for col in (:MSinceMostRecentDelq, :MSinceMostRecentInqexcl7days)
  df[!, col] = IAI.undo_mixed_data(df[!, col])

We will also convert the ordinal variables back to numbers so that XGBoost can work with them directly

df.MaxDelq2PublicRecLast12M = map(df.MaxDelq2PublicRecLast12M) do x
  if ismissing(x)
  elseif x == "Derogatory Comment"
  elseif x == "120+ Days Delinquent"
  elseif x == "90 Days Delinquent"
  elseif x == "60 Days Delinquent"
  elseif x == "30 Days Delinquent"
  elseif x == "Never Delinquent"
df.MaxDelqEver = map(df.MaxDelqEver) do x
  if ismissing(x)
  elseif x == "Derogatory Comment"
  elseif x == "120+ Days Delinquent"
  elseif x == "90 Days Delinquent"
  elseif x == "60 Days Delinquent"
  elseif x == "30 Days Delinquent"
  elseif x == "Never Delinquent"

As mentioned above, we will be doing the model fitting and interpretation in Python via PyCall.jl, so we will first convert the data to a Pandas DataFrame by saving the data to CSV and loading it Pandas:

CSV.write("train_df.csv", df[train_y.indices[1], :])
CSV.write("test_df.csv", df[test_y.indices[1], :])

using PyCall
pd = pyimport("pandas")
function load_data(filename)
    df = pd.read_csv(filename)
    y = df.RiskPerformance
    X = df.drop(["RiskPerformance"], axis=1)
    return X, y

train_X, train_y = load_data("train_df.csv")
test_X, test_y = load_data("test_df.csv")

Now we handle the columns that were mixed data in Julia. We use the weight-of-evidence (WoE) encoding to represent the column with numeric values.

ce = pyimport("category_encoders")
encoder = ce.WOEEncoder(cols=["MSinceMostRecentInqexcl7days",
train_X = encoder.fit_transform(train_X, train_y == "Bad")
test_X = encoder.transform(test_X)

Model Fitting

With the data prepared, we can fit an XGBoost model, validating over the depth:

xgboost = pyimport("xgboost.sklearn")
model_selection = pyimport("sklearn.model_selection")
xgb = model_selection.GridSearchCV(
    Dict("max_depth" => 2:8),
), train_y)

We can then evaluate the out-of-sample performance:

pred_y = xgb.predict(test_X.values)
metrics = pyimport("sklearn.metrics")
metrics.accuracy_score(test_y, pred_y)
metrics.roc_auc_score(Int.(test_y .== "Bad"),
                      xgb.predict_proba(test_X.values)[:, 1])

Model Interpretation

Now that we have a trained black-box model, we can investigate the approaches for explaining its predictions. We will focus on the same sample point used above with Optimal Trees:

xgb.predict_proba(test_X.values)[1, :]
2-element Array{Float32,1}:

We see that the predicted probability of default for this point under the XGBoost model is 90%, which is in line with the high risk of default predicted by the Optimal Trees model.


First we use LIME to explain the prediction for a single row at a time. LIME (local interpretable model-agnostic explanations) is an explanation technique that explains the predictions of a classifier by learning a linear model locally around the prediction.

Here we show the explanation for the same sample point as above:

lime = pyimport("lime.lime_tabular")
lime_explainer = lime.LimeTabularExplainer(
explanation = lime_explainer.explain_instance(test_X.iloc[1].values,

Lime attempts to explain the local sensitivity of the prediction to the features in the dataset. Here it tells us that the most significant factor in the prediction is the external risk estimate, and increasing this value will push the prediction towards Good. The next three strongest factors in the local prediction are the time since the most recent delinquency, the time since the most recent inquiry (where increases in these two push the prediction towards Bad), and average age of the credit file (where increases push the prediction towards Good).

We see that the two features that were used by the tree prediction show up among the most significant factors, but many more than these two are included, complicating the interpretation. Additionally, the explanation that increasing the time since the most recent inquiry would increase the chance of this being a Bad loan does not seem to make sense. Intuitively, more recent inquiries for credit is seen as a sign as greater credit risk, which is reinforced by the optimal tree model identifying a short time since the most recent inquiry as a high credit risk.

Another limitation is that the LIME explanations are local in nature and therefore serve more like a sensitivity analysis rather than a explanation for why the prediction was made in the first place.


Next we will generate model explanations using SHAP. SHAP is an approach to explain the output of any machine learning model by assigning each feature in the data an importance value for every prediction made:

shap = pyimport("shap")
shap_explainer = shap.TreeExplainer(xgb.best_estimator_)
shap_values = shap_explainer.shap_values(test_X)

First we generate a summary plot, that shows the overall importance of the features in the dataset according to the SHAP values

shap.summary_plot(shap_values, test_X)

This shows us the relative importance of the features in the model overall, and we can see the features with more impact are generally also the ones that were used by the tree.

We can also look at the variation of the SHAP values as the value of a single feature in the dataset changes:

shap.dependence_plot("ExternalRiskEstimate", shap_values, test_X,

This plot shows us that the external risk estimate is largely linearly correlated with its impact on the predicted outcome. From the color scale, the plot shows there is no interaction effect between the external risk estimate and the time since the most recent inquiry, as the trend in the plot is independent of color (despite the time affecting the external risk score). In comparison, by inspecting the trees output, we see that there are signs of interactions between these two variables as they are both used in different locations of the tree with different thresholds each time, indicating that each feature's importance to predictions might be dependent on the value of the other.

These previous plots were summarizing aggregate effects, showing how the features are related on average to the predictions. We can also use the SHAP values to investigate the prediction made for a single point (using the same single point as earlier for Optimal Trees and LIME):

shap.force_plot(shap_explainer.expected_value, shap_values[1, :],
                test_X.iloc[1], matplotlib=true)

We see that from the SHAP explanation that the most significant feature pushing the prediction towards Bad is the average age of the credit file. This is followed by the external risk estimate, and the time since last opening an account. There are over ten additional features with importances assigned in the explanation, pushing the prediction in both directions.

Note that the time since the most recent inquiry is not one of the most important factors. This is unlike in the LIME explanation where it was the third-most important feature, and in the optimal tree model where it was one of two relevant factors alongside the external risk estimate. This is again evidence that this approach to explanation may have trouble isolating the interactions between variables to succinctly interpret the model output.


In this example, we compared two approaches for predicting loan defaults and interpreting the resulting model. The logic behind the optimal tree model and its predictions is transparent and clear. In contrast, the LIME and SHAP explanations assigned importance to similar features used by the tree, but also to many others, making it harder to construct a succinct description of the prediction mechanism. In the example prediction we attempted to explain, LIME assigned importance to both of the features used by the tree, but for one of these the effect was likely in the wrong direction, while SHAP did not assign much importance to this second feature.