Interpretable Clustering of Credit Card Data

In this example, we showcase how Optimal Trees can be used in order to solve clustering problems in an interpretable way. Clustering is an unsupervised learning method that aims to segment a set of observations into similar groups based on the available data. Traditional clustering methods, including k-means, assign points to clusters based on some distance metric, without providing any specific description of the clusters and the reasons why points get assigned to them. We showcase here how Optimal Trees can be used to identify meaningful clusters in an interpretable way.

The first method, Interpretable K-Means Clustering, developed in Chapter 20 of Machine Learning Under a Modern Optimization Lens (2019), consists of interpreting the results of a traditional clustering model, such as k-means, by fitting a multi-class Optimal Classification Tree to predict the cluster number assigned to each sample in our dataset by the clustering model. The resulting tree will thus provide, without any manual intervention, both interpretations of the clusters and insights into the way points are assigned to these clusters.
The second method, Guided Clustering, can be used in cases where we actually know in advance a specific measure based on which we would like to cluster our samples. For instance, in the case of clustering credit card holders based on usage activity, clusters relative to the demographics of the customers would be more useful in a marketing application, while clusters relative to credit responsibility would be more useful in a risk management application. Depending on the desired application, we can then run our clustering process guided by this specific measure by reframing the problem as a traditional supervised prediction problem. In the resulting tree model, the leaves contain buckets of samples that behave similarly under our selected measure, thus providing directly interpretable clusters.

Throughout this example, we will use a credit card dataset summarizing the usage behavior of 9,000 credit card holders in 17 features. This dataset is openly available on Kaggle.

Data Preparation

We start off by importing the dataset.

using CSV, DataFrames
df = CSV.read("credit_card_kaggle.csv", DataFrame)
select!(df, Not(:CUST_ID))

8950×17 DataFrame
  Row │ BALANCE     BALANCE_FREQUENCY  PURCHASES  ONEOFF_PURCHASES  INSTALLMEN ⋯
      │ Float64     Float64            Float64    Float64           Float64    ⋯
──────┼─────────────────────────────────────────────────────────────────────────
    1 │   40.9007            0.818182      95.4               0.0              ⋯
    2 │ 3202.47              0.909091       0.0               0.0
    3 │ 2495.15              1.0          773.17            773.17
    4 │ 1666.67              0.636364    1499.0            1499.0
    5 │  817.714             1.0           16.0              16.0              ⋯
    6 │ 1809.83              1.0         1333.28              0.0
    7 │  627.261             1.0         7091.01           6402.63
    8 │ 1823.65              1.0          436.2               0.0
  ⋮   │     ⋮               ⋮              ⋮             ⋮                     ⋱
 8944 │    5.87171           0.5           20.9              20.9              ⋯
 8945 │  193.572             0.833333    1012.73           1012.73
 8946 │   28.4935            1.0          291.12              0.0
 8947 │   19.1832            1.0          300.0               0.0
 8948 │   23.3987            0.833333     144.4               0.0              ⋯
 8949 │   13.4576            0.833333       0.0               0.0
 8950 │  372.708             0.666667    1093.25           1093.25
                                                13 columns and 8935 rows omitted

The data has some missing values, which we impute using OptImpute:

imputer = IAI.ImputationLearner(method=:opt_knn, random_seed=1)
df = IAI.fit_transform!(imputer, df)

Once the missing data is imputed, we scale the data to have zero mean and unit variance so the clustering is not affected by the scaling in different columns.

using Statistics
X = Matrix(df)
X_norm = (X .- mean(X, dims=1)) ./ std(X, dims=1)

Interpretable K-Means Clustering

This first method consists of two steps:

applying a standard clustering algorithm, in this case k-means, to our dataset, in order to output a vector assigning each sample to a cluster number;
fitting an Optimal Classification Tree to predict this vector, and analyzing the resulting model.

K-Means Clustering

For the purposes of this case, we will pick $k = 4$ clusters. We first apply the k-means algorithm to our scaled dataset, and compute the cluster assignment as a vector y.

using StableRNGs, ParallelKMeans

rng = StableRNG(1)
k = 4
clustering_result = kmeans(X_norm', k; rng=rng)
y = clustering_result.assignments

8950-element Vector{Int64}:
 3
 1
 4
 3
 3
 4
 2
 4
 3
 3
 ⋮
 1
 4
 3
 3
 4
 4
 4
 3
 3

The clustering defined by k-means is difficult to interpret. Each cluster is represented by its centroid in 17 dimensions (one for each feature), and each point is assigned to the cluster with the closest centroid. Therefore, in order to understand why a point was assigned to a given cluster, we also have to understand why it was not assigned to any other cluster, which requires making sense of the 17-dimensional centroids.

Option 1: Optimal Classification Tree

We can use Optimal Trees to understand which features drive the division of samples into different clusters. To do this, we train an Optimal Classification Tree with the cluster assignments given by the k-means algorithm as the target variable:

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(random_seed=1, minbucket=50),
    max_depth=2:4,
)
IAI.fit!(grid, df, y)

Optimal Trees Visualization

We can now look at the tree to understand how points are assigned to clusters. Note that the intensity of the color of the leaf (e.g. dark red vs. light red) represents the proportion of points in each leaf that have the predicted class.

The tree first splits on PURCHASES_FREQUENCY, where the lower frequencies (left side of the tree) tend to be of clusters 1 and 3, and the higher frequencies (right side of the tree) tend to be clusters 1, 2, and 4.

When the purchase frequency is lower, the tree uses BALANCE, CASH_ADVANCE, and CASH_ADVANCE_TRX to determine whether the customer is in cluster 1 or 3. For example, those with lower balances and cash advances tend to be in cluster 3.

Similarly, when the purchases frequency is higher, CASH_ADVANCE, CASH_ADVANCE_TRX, and PURCHASES are used to determine the cluster. For example, cluster 2 tends to be those customers with more purchase.

A brief analysis of this relatively simple tree thus allows us to draw the following conclusions on our four clusters:

Cluster 1 aggregates heavy but infrequent users: despite not making many frequent purchases, they maintain high balances and cash advances.
Cluster 2 aggregates heavy users: they make purchases relatively frequently, leading to higher purchases.
Cluster 3 aggregates light and cautious users: they make purchases relatively infrequently and have low balance and low cash advances.
Cluster 4 aggregates light but frequent users: despite making frequent purchases, they hold low purchases amount and fewer cash advances.

We can calculate the accuracy of this tree in predicting the clusters.

IAI.score(grid, df, y, criterion=:misclassification)

0.9249162011173184

With a simple application of Optimal Trees, we were thus able to come up with both precise insights on our four clusters as well as the paths leading a sample to each of those clusters. Given the high accuracy of this tree, it could now be used as a clustering model as is, as an interpretable version of our previous k-means model.

Option 2: Optimal Policy Tree

Instead of treating the cluster assignment as a labeled classification problem, another approach would be to give the model the relative distance to each of the cluster centroids, and use Optimal Policy Trees to assign points to clusters in order to minimize the average distance to the assigned cluster.

Let's first calculate the distance to each centroid and use this as the reward matrix.

n = size(df, 1)
rewards = zeros(n, k)
for i in 1:n
  for j in 1:k
    centroid = clustering_result.centers[:, j]
    rewards[i, j] = sqrt(sum(abs2, X_norm[i, :] .- centroid))
  end
end
rewards = DataFrame(rewards, Symbol.(1:k))

8950×4 DataFrame
  Row │ 1        2         3        4
      │ Float64  Float64   Float64  Float64
──────┼─────────────────────────────────────
    1 │ 5.05604   9.45417  1.24565  3.34653
    2 │ 2.63242   9.17259  3.58534  4.9135
    3 │ 5.40308   7.90728  4.16504  3.44633
    4 │ 4.38792   8.57097  1.83591  3.6516
    5 │ 4.82516   9.37676  1.42728  3.43788
    6 │ 4.85841   7.93035  2.81228  2.2223
    7 │ 8.83973   4.61903  8.49642  6.57509
    8 │ 5.27382   8.38233  3.41028  1.99441
  ⋮   │    ⋮        ⋮         ⋮        ⋮
 8944 │ 6.76995  10.6567   4.43031  5.69278
 8945 │ 6.28939   9.69875  4.31911  5.08049
 8946 │ 7.02789   9.88848  5.32942  4.69652
 8947 │ 6.83982   9.90472  5.13935  4.71792
 8948 │ 6.73615  10.0087   4.79318  4.673
 8949 │ 6.23375  10.6477   4.28184  5.59323
 8950 │ 6.32234   9.46181  5.00377  5.3574
                           8935 rows omitted

Now we can train an Optimal Policy Tree to minimize this distance.

policy_grid = IAI.GridSearch(
    IAI.OptimalTreePolicyMinimizer(random_seed=1, minbucket=50),
    max_depth=2:4,
)
IAI.fit!(policy_grid, df, rewards)

Optimal Trees Visualization

We can similarly interpret the clusters defined by the trees:

Cluster 1 aggregates frequent but not-biggest-spender users: they typically have high balance, credit limit, or cash advances, but not high purchases.
Cluster 2 aggregates heavy users: they make purchases relatively frequently, leading to higher purchases.
Cluster 3 aggregates light and cautious users: they make purchases relatively infrequently and have low balance and low cash advances.
Cluster 4 aggregates light but frequent users: despite making frequent purchases, they hold low purchases amount and fewer cash advances.

Note that these definitions are largely similar to what we see from the Optimal Classification Tree result, but with a smaller tree (19 nodes vs. 23), the results are somewhat easier to interpret.

If we take the cluster with the best reward as the predicted cluster, we can also calculate the accuracy of this tree:

pred_clusters = parse.(Int, IAI.predict_treatment_rank(policy_grid, df)[:, 1])
accu = sum(pred_clusters .== y) / length(y)

0.9207821229050279

We obtain comparable accuracy compared to the Optimal Classification Tree based approach. For the same performance, the smaller size of Optimal Policy Tree suggests that the policy tree is more efficient at segmenting different clusters. This is not surprising because under the Optimal Policy Tree approach, we are not only giving the model the information on what is the closest centroid, but its relative distance to all centroids, hence it helps the model to learn the separation better.

Guided Clustering

Let's now consider the case where a bank specifically wants to segment its clients based on their credit responsibility. As a proxy for credit responsability, we can focus on the feature :PRC_FULL_PAYMENT, or percentage of full payments. Indeed, there tends to be a correlation between making full payments and using credit responsibly.

We therefore train an Optimal Regression Tree to predict this feature from all of the other features in the dataset. We introduce X_guided and y_guided to train our tree on.

X_guided = select(df, Not(:PRC_FULL_PAYMENT))
y_guided = df.PRC_FULL_PAYMENT

grid_guided = IAI.GridSearch(
    IAI.OptimalTreeRegressor(random_seed=1),
    max_depth=2:2,
)
IAI.fit!(grid_guided, X_guided, y_guided)

Optimal Trees Visualization

We obtain a tree which clusters the data into four buckets, each of them with different levels of full payments. Additionally, the structure of the tree gives us insights into how other features are used to form these clusters.

Node 3 contains customers with a medium percentage of full payments, and these customers have lower purchases and low balances.
Node 4 includes customers with the lowest percentage of full payments. These customers are characterized by low purchases but high balances.
Node 6 has customers with the highest percentage of full payments, characterized by high purchases and low minimum payments.
Node 7 contains customers with a low percentage of full payments, with the customers characterized by higher minimum payments and high purchases.

We can calculate the R-squared of this Optimal Regression Tree.

IAI.score(grid_guided, X_guided, y_guided, criterion=:mse)

0.35557192856194064

Despite only splitting the data into four clusters, the resulting clustering explains a decent degree of the variance in the dependent variable, as reflected in the R-squared.

It is interesting to note that in this guided clustering example, our tree aggregates samples into four clusters using features related to credit availability. On the other hand, our unsupervised interpretable clustering trees picked more general behavioral characteristics. When looking to cluster data, it may thus be beneficial to evaluate whether we want clusters based on a certain feature that would be relevant to our model's application.